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Summary 

The work reported here was conducted by a number of students at Cornell 
University and myself. A major contribution was made by Cristina Mahon (previ- 
ously Cristina Moura); her masters thesis is included as Appendix A The work for 
this grant falls into two main categories: algorithms for the MPP and the Parallel 
Pascal Development system. A number of novel algorithms for data arbitrary data 
mappings, permutations, and image rotation including interpolation have been 
developed and implemented on the MPP. A program development system has been 
developed for both the MPP and conventional serial computers. This system greatly 
simplifies the development of high level language programs for the MPP. Further- 
more, it allows programs to be developed and tested on any conventional computer. 
This environment consists of a set of system programs and a library of general pur- 
pose Parallel Pascal functions. 

A detailed description of the data mapping and rotation algorithms for the MPP 
is given in section 3 of Cristina’s thesis which is included as Appendix A These algo- 
rithms have been published [1] and a copy of this paper has been included as Appen- 
dix B. The specification of the Parallel Pascal language has now been published [2, 3] 
These papers are included as Appendices C and D. The documentation for the Parallel 
Pascal Development system is given in Appendix E and a description of using this sys- 
tem on the MPP is given in section 2 of Appendix A 

In addition to the development of the reported algorithms and software this 
grant provided us with the opportunity to be the pioneer remote users of the MPP. 
Most of the work on the MPP was done from Cornell over a telephone line which in 
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itself absorbed a significant amount of the available funds. We also made several 
visits to NASA; on two occasions seminars were presented on the Parallel Pascal en- 
vironment. Considerable discussions were held wiht NASA on the development of 
the I/O system for the MPP and other aspects of the high level language environment 
The complete Parallel Pascal development system has been installed on the MPP host 
and has been distributed by us to a number of remote sites. It has been ported to a 
number of conventional computers including VAX systems running either VMS or 
UNIX. 

Some of the highlights of the results of this research are listed below. 

Image Processing Algorithms 

The following algorithms are described in detail in section 3 of Appendix A. 

Data Mapping 

A fast heuristic arbitrary data mapping algorithm has been developed. For most 
mappings this is much faster than other techniques such as sorting. This has 
been implemented for both regular (128 x 128) and large ((n * 128) x (m * 128) 
two dimensional arrays. 

Matrix Rotation 

Fast matrix rotation algorithms have been implemented based on the above data 
mapping function. The nearest neighbor algorithm ahs been tested on the MPP. 
Large matrix nearest neighbor rotation, and interpolation schemes have been 
developed and tested on conventional computers but have not yet run on the 
MPP. 

Interpolation 

A high speed interpolation algorithm has been implemented for bilinear and bi- 
cubic interpolation for image rotations (any angie) and smaii matrix warps. 
These algorithms work on the development system but have not yet been tested 
on the MPP. 

The Parallel Pascal Development System 
Compiler Command 

A command file has been written both for the MPP and the development system 
which, with a simple noninteractive command, compiles a program, makes all 
the library links, and in the case of the MPP, loads the program onto the system. 
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See section 2 of Appendix A for details. 

Library Preprocessor 

Standard Pascal does not have any library facilities. A general purpose library 
preprocessor has been developed which works for both the MPP compiler and 
the development system. See Appendices D and E for details. The preprocessor 
looks for library subprograms first in named files, then in the local directory, and 
finally in a system library directory. The MPP compiler version of the prepro- 
cessor also examines a special MPP system library before the general system li- 
brary. This library contains system library programs which have been modified 
to overcome deficiencies in the MPP compiler. This library preprocessor can be 
used in conjunction with the assembly language library feature which is built 
into the MPP Parallel Pascal compiler. It is also able to work in conjunction 
with any library facilities that are available with a local Pascal compiler that is 
used with the development system. 

Parallel Pascal Translator 

The translator is the heart ot the development system. It translates a Parallel 
Pascal program into standard Pascal for execution on a conventional serial com- 
puter; See Appendix E for details. The translator is a Pascal program with over 
8000 lines of code. It has been in a very stable form for over a year now. It 
still has some limitations but these are now well documented. In addition to be- 
ing used by this and other MPP research groups it has also been used in a Paral- 
lel Processing course which has now been offered three times with an enrollment 
of about 50 students each time. 

The System Library 

A set of general purpose library programs have been developed. All of these pro- 
grams run correctly on the development system and nearly all of them have been 
tested on the MPP. Documentation for these programs is given in Appendix E. 

General Utilities 

Programs in the general utilities group include a parallel random number gen- 
erator, an index generator, simplified I/O functions and a parallel ceiling func- 
tion. 

Masked Reduction Functions 

It is frequently necessary to apply a reduction function to a subregion of an ar- 
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ray. The masked reduction library functions are similar to the primitive reduc- 
tion functions except that a Boolean array mask parameter is required. 

Large Array Utilities 

The large array utilities are shift and rotate functions that operate on matrices 
which have dimensions that are multiples of 128. Both the crinkled and blocked 
data structures are supported. There are also shift and rotate functions which 
treat a 128 x 128 array as a vector of 16384 elements. These functions do not 
use the hardware spiral interconnections; the use of the regular mesh intercon- 
nections is faster for multiple element shifts. 

Near Neighbor Convolution Functions 

Many low level image processing applications require convolutions between im- 
ages and small kernel matrices. Programs in this group simplify the entry of 
small matrices and the application of these matrices to image arrays. 

Pyramid Operations 

A pyramid convolution function has been implemented; this is a three dimen- 
sional convolution function which operates on the 13 near neighbors of a pyram- 
id data structure. This data structure is embedded within an MPP array. Func- 
tions are also available for both the vertical and horizontal data shift operations 
that are associated with pyramid algorithms. 
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ABSTRACT 


The Massively Parallel Processor (MPP) is a SIMD computer with. 16384 processing ele- 
ments connected in a 128 x 128 mesh- The MPP is programmed in a high level language 
called Parallel Pascal, a superset of standard Pascal used to p r o g ra m parallel computers. This 
thesis describes system and programming tools for the Parallel Pascal development system and 
the MPP-compiler system for the MPP. Some of these tools were developed to help remote 
users, while others were developed to give users a way to work, around features not yet 
implemented on the MPP. 

The organization of the MPP is ideal for problems which involve near neighbor interac- 
tions. In this thesis we present a general algorithm for implementing arbitrary permutations 
and mappings on such systems as well as matrix rotation schemes for nearest neighbor, bil- 
inear interpolation and bicubic spline interpolation mappings: An algorithm for the Ising 
model and its implementation on the MPP using Monte Carlo techniques is also discussed. 
Such an application is well suited for the MPP because it uses mainly near neighbor interac- 
tions and Boolean operations. Results and timings obtained from a direct implementation of 
those algorithms on the MPP are also reported. 
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CHAPTER 1 




INTRODUCTION 




This thesis is concerned -with programming tools and parallel algorithms created for the 
Massively Parallel Processor (MPP) located at NASA Goddard Space Center. The goals of this 
thesis are to create a user-friendly environment for high-level language parallel algori thm 
development, as well as research the issues involved in implementing certain algorithms on 
the MPP and compare the expected results with the achieved results. 

Digital computers as we know them today have always had a certain degree of parallel- 
ism, first for reliability purposes, and later also for greater performance. Parallelism, for the 
purpose of increased pe rform ance, can be classified into the following types, [l]. 

- serial by bit 

- serial by character 

-"parallel word" but serial instruction 

- parallel instruction execution/access 

- parallel instruction execution units 

- parallel instructions (SOMD) 

- parallel instructions (MIMD) 

The term parallel processing is usually used to refer to the last two types of parallelism 
listed above, since those are the types that allow parallel work on more than one set of data at 
any single time. .An SEMD (single-instruction stream, multiple-data streams) computer is com- 
posed of several processors which perform the same operation on different pieces of data. An 
MIMD. (multiple-instruction stre ams, multiple-data streams) computer, on the other hand, is 
composed of several processors performing different operations on different pieces of data. 
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These two approaches underlie different purposes. S3MD computers are lockstep process- 
ing type computers, while MIMD computers are asynchronous processing type. That 
difference tends to make SIMD computers more specialized than MIMD, since they do not 
allow for independence between their processors. The Massively Parallel Processor (MPP) pro- 
duced by Goodyear Aerospace for NASA Goddard Space Center belongs to the SIMD class of 
processors. It was developed for image processing of satellite data. 

1.1. The MPP Architecture 

The block diagram of the MPP is shown in Figure 1.1. It consists of the following five 
main components. The Array Unit (ARU) processes two dimensional arrays of data. It is con- 
trolled by the PE Control Unit which executes parts of the user p ro g ram that contain array 
operations: The Staging Memory handles the array I/O by both storing and rearranging array 
data. The Main Control Unit (MCU) executes the application program, performs scalar opera- 
tions and calls on the PE control unit to perform the array operations of the application pro- 
gram. This is the main interface unit to the host machine. Finally, the host computer, a VAX 
11/780, serves as the communication unit between the user and the MPP. [2]. 

1.1.1 . The Array Unit : 

The component that makes the MPP special is the ARU. Logically, it contains 16384 bit 
processing elements (PFs) organized as a 128 by 128 square and operating at a basic cycle of 
100 nsec. Physically, it contains an extra 128 by 4 rectangle of PFs that is used for reliability 
purposes. When a PE fault is detected the ARU can be reconfigured by replacing the 128 by 4 
rectangle containing the faulty PE by the extra rectangle. Each PE supports boolean and 
arithmetic operations, is maskable and is capable of routing data to its orthogonal neighbors. 
The speed of some typical operations can be seen in table 1.1. [3]. 

The MPP is constructed using a total of 2112 VLSI chips. This total includes the spare 
column of chips for red undan cy, Each chip contains eight PFs configured in a 2 by 4 array. 



FIFO 


4 . 


PE Control 
Unit 




global or function 
control path ■ 

3 data path 


Figure 1.1. MPP Block. Diagram 
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Table 1.1. Speed of Typical Operations 


OPERATIONS 

EXECUTION 

SPEED 

ADDITION OF ARRAYS 


S-BIT INTEGERS (9-81T SUN) 

6553 

12-BIT INTEGERS (13-BIT StBI) 

0428 

32-3IT FLOATING-POINT NUMBERS 

470 

MULTIPLICATION OF ARRAYS 


(ELEMENT-BY-ELEMENT) 


8-BIT INTEGERS (16-aiT PRODUCT) 

1861 

12-BIT INTEGERS (24-BIT PRODUCT) 

910 

32-BIT FLOATING-POINT NUMBERS 

231 

MULTIPLICATION OF ARRAY BY SCALAR 


3-BIT INTEGERS (16-BIT PRODUCT) 

2824 

12-BIT INTEGERS (24-SIT PRODUCT) 

1489 

32-BIT FLOATING-POINT NUMBERS 

37! 


‘MILLION OPERATIONS PER SECOND 


an eight bit bi-directional data port, and a disable circuit that is used to disconnect the chip 
from its east-west neighbors. This last feature is useful when repairing the array using the 
redundant PFs. The chip does not include PE memory so as not to complicate the chip design, 
to allow the MPP to use more memory per PE at a faster access time and to allow future 
expansion of PE memory without chip redesign. 

Each PE contains six single bit registers (AAC,GJ\S), a variable length shift register, a 
full adder and some combinatorial logic. The logic diagram of a PE is shown in Figure 1.2. [3]. 
The PE has four subunits that have independent control but share a co mm on clock. They axe: 
logic and routing, arithmetic, I/O, and masking subunits. They are interconnected by a bi- 
directional data bus which also connects to external PE memory. 

The logic and routing subunit is formed by the P register and some supporting com- 
binatorial logic. Register P can be logically combined with the state of the data bus and the 
result would be stored in register P itself. When the routing is enabled the state of one of the 
P registers in the north, south, east or west neighbor PFs is latched in P. This allows the rout- 
ing of data between neighboring PFs. 













original page ss 

OF POOR QUALITY 



The arithmetic subunit contains a serial-by-bit a d d er formed by registers 3 and C and 
a variable length shift register whose output may be stored in A. The inputs for the adder 
are registers P and A- Register A can be loaded directly from the data bus. Register B stores 
the least significant bit and register C the carry. The variable length shift register is used to 
improve the multiply and divide operation times. 

The I/O subunit is formed by the S register and a two input multiplexor which selects 
the input from either the data bus or the S register of the PE west neighbor. The data enters 
the array coming from the west. 

Finally, the making subunit is formed by the G register which decides which PE’s are 
active. A PE will be active when its G register is enabled. It allows routing and arithmetic 
operations to be masked separately. The ARU, composed of all the PE’s, is controlled by the 


PE control unit. 
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1.1 2 . The PE Control Unit : 

The PE control unit is a microprogrammed control unit with 64-bit words. It performs 
all the basic array arithmetic operations of the application program, leaving no idle ARU 
cycles. [4J. It also determines the connectivity between opposite edges of the ARU through a 
3-bit code held in the topology register. Before the application program starts execution, the 
PE control unit memory is loaded from the host computer. 

The address of the bit-planes to be processed by the PFs is generated by the bit-plane 
address generator in the PE Control Unit and then broadcasted to all PE random-access 
memories in the ARU. The PE Control Unit is connected to the Main Control Unit by a first- 
in-first-out buffer called the call queue. The call queue holds calls to PE control routines 
from the Main Control Unit, which allows the Main Control Unit to work ahead on the 
application pr o gra m without waiting for a call to the PE control unit to complete. 

Scalar information associated with an MCU call is put into a common register from the 
call queue at the beginning of the called routine. At the end of the called routine, the con- 
tents of the common register can be transmitted back to the MCU. The PE control unit, if 
specified, can combine certain common register bits with the control lines by performing a 
logical-OR operation. Since the MCU can initialize the common register when a PE control 
unit routine is entered, this capability allows the main control to specify certain PE operations. 
An example of an application of the logical-OR function is array multiplication. It is useful 
for setting the lengths of the PE shift registers. Without this logical-OR capability, a different 
multiplication routine would be needed for each shift register length. The logical-OR is also 
useful to change the common register when the contents of this register have to be sent back 
to the MCU. 

The PE control unit allows operations like a global OR function to be performed on all 
the elements of the array. Such an operation is extremely useful to the programmer since it 
allows conditions to be determined based on the status of the whole array. 
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1.1. 3. The Staging Memory : 

The staging memory is connected to the I/O ports of the ARU and to a host computer. It 
acts not only as a buffer to the ARU data but it also allows the arrays of data to be reordered. 

Since the ARU ports transfer the data in a bit-sequential order, that is the most (or least) 
significant bit of 16384 elements followed by the next bit of 16384 elements, etc, the reorder- 
ing provided by the staging memory is necessary to organize the data in the normal order of 
satellite ima gery used in the host. 

1.1. 4. The Main Control Unit : 

The Main Control Unit (MCU) is essentially a high-speed 16-bit minicomputer. It con- 
tains 50 16-bit registers. [5]. Mast of those registers are used to communicate with the PE con- 
trol unit and through the I/O control unit to the Staging Memory. The main control memory 
of the MCU is loaded by the host computer before an application program starts execution. It 
holds instructions and scalar data for the MCU. This memory is also shared with the I/O con- 
trol unit for the Staging Memory. It holds I/O programs for the I/O control unit. The MCU 
mmmiinicatM with the PE control uni t through a FIFO buffer which was explained in section 
1 . 1 . 2 . 

1-1-5- The host m aching 

The host computer is a VAX 11/780 running the VAX- VMS operating system. An 
appl ication program is split between the host computer and the MCU; the progr amm er 
specifies which sections of the pr ogr a m are to be run on the MCU. 

On the MCU p rngr a ms have direct high speed access to the ARU, however there are 
y>r n» important limi tations. First the MCU only has 32 K words of memory, second the MCU 
has no access to system peripheral devices other than the MPP and finally the MCU has no 
hardware for scalar floating point ari thm etic- On the VAX there is no restrictions on memory 
space, full access to all peripherals (except direct access to the .ARU) and a floating point 
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accelerator. Currently, the VAX is also shared with other users which can seriously affect 
performance. The programmer must decide the trade-offs in splitting an application program. 
A significant time penalty in transferring control and data through the UNIBUS link which 
connects these two systems must also be considered. 

L2. Outline 

The goal of this work, was to develop algorithms in a high-level language that would 
take advantage of the MPP architecture described above. The high-level language for the 
MPP is Parallel Pascal [6]. which is a superset of standard Pascal for progr amming parallel 
computers: In Parallel Pascal all conventional expressions are extended to array data types, 
that is, arrays can be treated as units. 

The development system used to create Parallel Pascal p ro g r ams is divided in two major 
parts. First, a Parallel Pascal to standard Pascal translator used in conjunction with a library 
preprocessor, both developed under the UNIX operating system, allow for algori thm develop- 
ment and testing on a serial computer. This system was adapted to run also under VAX- VMS, 
the operating system used by the MPP VAX 11/780 host computer. The translator and the 
system created around it allows the initial program development to be done in any serial com- 
puter running UNIX or VMS. After this initial phase, an MPP compiler system running 
tinder VMS is used to prepare the p ro g r a ms for execution on the MPP. This part of the work 
is done on the host machine. 

These systems as well as other tools used for algorithm development are described in 
greater detail in chapter 2. In chapters 3 and 4, algorithms that were implemented on the 
MPP and p e rform ance measurements obtained are discussed. The algorithms include parallel 
imag e rotations and an Tsing model. For these algori thms expected results are compared to 
obtained results, so that conclusions on how well these algorithms performed can be drawn- 



CHAPTER 2 


SYSTEM AND PROGRAMMING TOOLS 


The algorithms implemented in this thesis were written in Parallel Pascal Parallel Pas- 
cal offers efficiency, portability as well as error detection and diagnosis facilities, features 
which when combined create a more pleasant environment than assembly language for pro- 
gram development on the MPP. [6]. 

Parallel Pascal contains three basic extensions to standard Pascal: [l\ 

1) expressions involving whole arrays are perm itted; 

2) the where - do - otherwise control statement is available. This statement ia a parallel 
version of the if - then - else statement; the control expression must evaluate to a boolean 
array. All array assignments within the controlled statements must be conformable with the 
control array and are masked by it. 

3) there are array reduction functions available. These functions reduce arrays according to 
the or, and, minimum, maximum, sum or product operations. 

Programming in Parallel Pascal on the MPP involves using a development system on a 
serial computer to do the preliminary algorithm development work and then, using a compiler 
to check, for the limitations and restrictions of the MPP. The advantages of this two-step 
approach are that the initial work can be done in any serial computer running UNIX or VMS 
and having a Parallel Pascal development system, that is, independently of the MPP, there are 
no compile time restrictions as the ones encountered when using the Parallel Pascal compiler 
for the MPP, and, finally, different parallel array sizes can be used to test some of the algo- 
rithms without being restricted to the 128 by 128 MPP size. This two-step approach is spe- 
cially useful for users that are not in the same location as the MPP. 
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2.1. MPP T .imitations 

Several programs written in standard Parallel Pascal that had successfully passed the 
first phase had to be significantly modified to run on the MPP hardware. Modifications were 
necessary because of the intrinsic limitations of the MPP. Table 2.1 contains a list of the more 
significant limitations. 

One of the limitations of the MPP is that the pr o g r amm er has to explicitly assign pro- 
grams; procedures and functions either to the host or to the MPP. This is done through a com- 
piler switch {$H}. Parts of the program that will run on the host are preceded by a {$H+} 
and parts that will run on the MPP are preceded by a {$H-}. These two instructions may 
only appear before the program, procedure and function statement! This compiler switch 
or target machine option affects the p rogr amming unit to which it is attached and all syntacti- 
cally inner functions or procedures, except when those inner functions or procedures are pre- 
ceded by their own target machine option! 

2JL Software Structure 

The work in this thesis represents the first project on the MPP which: a) involved the 
development of large programs and libraries and b) was conducted by remote access to the sys- 
tem, Le. a 1200 baud telephone line from Cornell. Several system tools had to be developed 
for effective remote use of the MPP. 


Table 2.1. MPP T. imit ations. 

1) The two lower dimensions of a parallel array have to be 128 by 
128. 

2) The local PE memory size is 1024 bit! 

3) Program! procedures and functions have to be assigned to the 
host or the MPP by the programmer. 

4) The local MCU memory size is 65K bytes. 
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22.1 . Development System Tools 

The development system is used to develop and test Parallel Pascal programs in a serial 
environment. It consists of two p mcr amg extern, a Pascal external function preprocessor, and 
ppt, the Parallel Pascal translator. Used in conjunction with them is a set of library functions 
and a standard Pascal compiler. 

For a Parallel Pascal p r o gr a m to run on a serial computer, it has to go through the fol- 
lowing steps (see figure 2.1): 

- the program extern will replace all the calls to external procedures and 
functions by the procedures themselves, which are read from a set of libraries. 

- the output of extern will be translated by ppt, the Parallel Pascal translator. 
Any compiler detected errors will be reported in a listing at this stage. The 
output will be a program in conventional Pascal. 

- the program is compiled by a standard Pascal compiler. 

In order to allow a user to easily translate a program from Parallel Pascal to Pascal, a 
command file was created. By simply typing 
pp Jilenante.pp 

the user's pr ogr a m will be sent through extern and ppt and a listing file as well as a compiled 
and linked pr og ra m win be created. 

The command pp on VMS is the same command used for the development system at 
Cornell University. It was chosen so that all the documentation already available at Cornell 
could be used without modifications at NASA. However, at NASA, outside the account 
REEVES, used to develop this system, another command name must be used since pp invokes 
the local Parallel Pascal compiler used in conjunction with the MPP. It has not yet been 
decided which of the two systems will retain the name pp. The listings of the command file 
for the VMS system can be found in section 6.1 of appendix A. Its corresponding sheil script 
in UNIX can be found in section 62 of appendix A. 
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name.pp 


C preprocessor^N 
extern 


libraries 


translator 

ppt 



-> listing file 


name.p or name.pas 


Pascal compiler*^) 


executable program 


Figure 2.L Steps of Parallel Pascal Development System 

There are two sets of libraries containing Parallel Pascal functions and procedures. One 
of them is defined to be used as part of the development system in conjunction with extern, 
while the other is targeted for use directly on the MPP and is accessed through a modified 
version of extern called mppextem. 

The only difference between extern and mppextem is that mppextem checks for a spe- 
cial MPP library first and, if none is found, uses the standard library. Extern uses only the 
standard library. Both libraries contain the same set of functions; however, the MPP-targeted 
functions were modified to fit some of the limitations and restrictions of the MPP. Section 2.3 
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will describe in detail some known MPP restrictions and tie changes to p ro g r am s written in 
standard Parallel Pascal required by them. 

2.2. 2. MPP-Compiler System 

After an algorithm has been tested on the host using the development system, the pro- 
gram can be adapted to use the MPP for its array computations. The MPP is considered as an 
attached processor to the VAX 11/780. This section will discuss how to compile, assemble, 
link, and execute a Parallel Pascal p ro g r a m on the VAX 11/780 and MPP systems. The 
features added in order to create a more user-friendly environment for those users attempting 
to use the MPP are also described. 

Parallel Pascal programs can execute on the VAX 11/780, the MPP or on a combination 
of both- [Sj. Compiler switches can be used to specify on which system the code will execute; 
however, all main programs begin execution on the host machine. 

The first system tool created implements all the steps needed to compile a p rogram for 
the MPP. To use this tool all that is required is typing the command: 

rmpp filename The listing for the command file can be found in section 6.3 of 

appendix A. 

This command file does library preprocessing using mppextem to substitute all calls to 
external functions by the body of these functions. The new file created has a t appended in 
front of the old filename. A second command file, called repL, has been developed to do just 
the library preprocessing step. The syntax for this command is 

re pi filename. pp 

In rmpp , once the automatic preprocessing is finished the resulting file is compiled, 
assembled and linked according to the list of commands below: 


S PP filename or SPPDEV filename 
% MACRO filename 
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$ MCL filename 
$ MPPLJNK filename 
$ CADLNK filename, flenamestb 

Figure 22 shows a flowchart of the different commands executed and all the different 
flies which are created. Each command includes several qualifiers to further specify which 
actions should be taken by the system. These are provided by the co mman d file. 

The first command: PP filename when used inside the account REEVES would cause 
problems since PP is locally defined to mean the Parallel Pascal translator and not the Parallel 
Pascal compiler, which is what we want in this context. Fortunately, the people in charge of 
the Parallel Pascal compiler suggest the use of the command PPDEV instead of PP when cal- 
ling the Parallel Pascal compiler. PPDEV corresponds to the compiler in the system develop- 
ment directory and which has some of the bugs of PP already fixed. Since we use PPDEV 
each time the compiler is called there is no possible ambiguity. To my knowledge, no one uses 
anymore the command PP when referring to the Parallel Pascal compiler. 

The compiler inputs Parallel Pascal source code and outputs P-code. The code generator 
is invoked by the same command and it uses the P-code to produce MACRO (VAX assembly 
language) and MCL (MCU assembly language) code. The code generator will only execute if 
the compiler has not detected any errors. [8]. 

The files generated by the compiler are: 

- flenameliy. Parallel Pascal listing file 

- fdenaiTve.pcd: P-code file 

The compiler switches to produce the Us and pcd files are default These switches can be 
turned off by including {$L-} and {$C-} respectively in the p rogram. However, in the absence 
of a P-code file the code generator will not be able to run. 

Another of the compiler switches {$H} was mentioned before in section 2.1. It allows 
the p ro g r amm er to specify which parts of the program will run on the MCU and which parts 
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name.pp 




compiler 


code 

generator 


assemblers 


linkers 


executable 

files 


Figure 2J2. Flowchart of MPP-Compiler System 


will run on the host. The switch {SD+} enables symbol debugging information to be passed to 
the assemblers. The default is that is, option disabled. More information on these 
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switches and others can be found in the Parallel Pascal User’s guide. [8]. The code generator, 
which executes after the compiler, produces the following files: 

- filenamejcgb a debug file 

- fiLenamemar. VAX assembly code 

- filenamejnct MPP assembly code. 

After the assembly files are produced, two different assemblers need to be used. One is 
the VAX 11/780 assembler and the other one is the MPP assembler. 

The VAX assembler is called by the command: 

$ MACRO filename or filenameAlAR 

where MAR is the default extension for files to be assembled by the VAX assembler. Its out- 
put is an object file with extension OBJ. 

The MPP assembler is invoked by: 

S MCL filename or filenameJVICL 

where MCL is the default extension for files to be assembled by the MPP assembler and stands 
for Main Control Language. By default it outputs an object file filenameAlOB. An MCL list- 
ing can also be created by using the /LIS qualifier. 

The next step is to link the assembled programs. Two links are necessary. The first one, 
for the MCU-resident code, provides a symbol table that will be used in conjunction with the 
second link. It is invoked by the command MPPLINK. The second link is called CADLNK 
and is used to link together the symbols from the MCU code to the ones from the VAX 
resident code. These two links have to be performed in this order. 

The symbol table created by the MCU linker contains symbol definitions for all global 
symbols in the image. It is created by default if an executable file is created. Its extension is 
STB. This symbol table is used as input to the subsequent VAX linker. The VAX linker is 
called by. 

$ CADLNK filenameCOBJ), filename^TB, (PPDEVRUN/lib) 
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When the VAX linker is invoked, the host-resident object module is linked to the global 
symbols defined by the MPP linker. An extra runtime library can also be linked with the 
p r o gram. For the moment, to use this library the statement added to the CADLNK command 
is PPDEVRUN/lib. 

The MPPLUMK produces two files, filenameJvlME and filenameJMPE. FilenameJVIME and 
filename^lPE contain the executable code for the MCU and the PE array respectively. The 
VAX linker produces filenameJEXE, which contains the executable code for the VAX. 

To execute the program the command 

$ CAD filenameCMME) filenameCMPE) filenameCEXE) 
has to be given. If compilation, assembly and linkage didn’t produce any errors, then that 
command can be typed after successful completion of the rmpp command, which includes all 
the commands described above. 

The CAD command will initiate a CAD (Control and Debug) session. [9]. The Control 
and Debug (CAD) program is an interactive program that allows the user to control the MPP. 
Its debugging portion allows the programmer toe 

a) load MPP programs. 

b) control execution of the p ro gr am. 

c) display data. 

d) detect, locate and patch errors. 

The total program compilation takes a long time. Therefore, it is usually worthwhile to exe- 
cute the compilation in batch mode. This frees the terminal for further work, which is espe- 
cially helpful when all the communication is done over a long distance phone line. 

In order to submit a pr ogram for batch processing all that is required is to create a com- 
mand file that sets the default directory to the directory where the file is located and then 
gives the command to be executed in batch. The general format of this file is 
% sd [ directory ] 

% ! command to be executed 
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Once this hie exists the batch tasic is initialed by typing; 
batchm command _ftle. 


The command batchm is a simplification that can be used ingtaari of the system co mmand 
submpp followed by several qualifiers. It automatically uses the qualifiers best suited for 
someone working over the phone line and therefore without direct access to hard copies of log 
files produced during the batch job. 


2.3. MPP Compiler Restrictions 

Modifications to standard Parallel Pascal programs for them to run on the MPP due to 
the limitations listed in table 2.1 were anticipated. However, we found out that since the sys- 
tem is still being developed, there are also several restrictions to the theoretically possible 
operations. Table 2 2 contains a list of the main compiler restrictions. Substantial additional 
program modifications were necessary and new programming tools had to be developed. 


Table 22. MPP Compiler Restrictions. 

1) The MPP has no high-level I/O, all high-level I/O is done on 
the host. 

2) MPP-host interface restrictions 

- An MPP-resident routine can only call other MPP-resident 
routines. 

- An MPP-resident routine may not reference a variable 
defined in a host-resident routine. 

- Non-parallel arrays cannot be converted to parallel 
arrays or vice-versa. 

- Parallel arrays can only be passed by reference. 

- A maximum of one kilobytes can be passed at a time. 

- Arguments passed cannot be used as loop counters, 
(supposedly fixed) 

3) Several primitive functions are not currently implemented on 
the MPP array, e.g. ord , transpose. 

4) The number zero cannot be used as a parameter in the functions 
shift and rotate. 

5) No good timin g mechanism to time long operations. 

6) Parallel array elements cannot be individually indexed. 




19 


For example, since for the moment there is no high-level language I/O on the MPP, and 
copying of parallel arrays to the VAX is not currently implemented, a mechanism to con- 
veniently inspect parallel array data was necessary. Therefore, a function that accesses any 
element of an array was created. This function is called accessmpp. Its inputs are the matrix 
whose value is to be accessed and two integers which are the coordinates of the value to be 
accessed. The top left comer of the matrix has coordinates (1,1). The output from the access 
function is a single value of the same base type as the matrix. It is the value of the matrix at 
the coordinates requested. There is also a version of accessmpp for boolean matrices. The 
boolean access function is called accessmppb. The two library functions can be found in sec- 
tions 6.4 and 6.5 of Appendix A. 

Both these functions create a temporary mask. This mask is all zeros, except for the posi- 
tion denned by the input coordinates. The value selected by the mask is stored in a matrix 
whose other values are zero for accessmpp and false for accessmppb. By using the sum and 
or reduction functions, respectively for accessmpp and accessmppb, the temporary matrix is 
reduced to a single value Which constitutes the output of the access functions. To read several 
values out of a matrix the access functions have to be called several times. The approach of 
transferring a single value at the time from the MCU to the host, instead of transferring a 
whole array, was chosen to avoid problems do to the limited size of arrays that can be passed 
between the two machines. 

The access function is only a first step that allows pr ogr amm ers to verify small parts of 
their matrix results, given the current constraints imposed by the MPP system. More power- 
ful tools need to be developed to create a high-level I/O for the MPP. 

Another r estr i ction is that an MPP-resident routine may not reference a variable defined 
in a host-resident routine. The solution is to explicitly pass to the MPP-resident routine all 
variables in the host-routine that have to be referenced. 

For the problem with the shift and rotate functions not being able to receive directly 
the number zero as a parameter, the fix used was to create a local variable and assign to it the 
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integer zero. Then,- that variable was used when invoking the shif t or rotate functions. 

Possible solutions to the lack of a timing function able to measure long time intervals on 
the MPP are still being studied. The current timing function is limited by the size of the 
internal counter. A solution would be to reset the internal counter before it overflows and 
before exiting the improved printing function, calculate the total time. Another different 
approach, would involve the performance monitor of CAD. A command file would obtain the 
memory locations of the lines between which a timing measurements is to done, and would 
pass those values to the performance monitor. The performance monitor would then set 
breakpoints at those locations and call the timer. 

Once the programs developed were modified to take into account the restrictions and 
limitations of the MPP, it was possible to verify their correctness directly on the MPP and to 
obtain some performance measurements. The next two chapters present implementations and 
performance measurements for image permutation and rotation algorithms and an Mug model. 


CHAPTER 3 


PERMUTATION AND ROTATION ALGORITHMS 


The only permutation which is directly implemented by the MPP is the near neighbor 
rotate (or shift). The direction of the rotation may be in any of the four cardinal directions. 
In Parallel Pascal the main permutation functions are multi-element rotate and shift func- 
tions; other permutations are built on these primitives. 

The rotate function takes as arguments the array to be shifted and a displacement foe 
each of the array dimensions. For exam ple consider a one dimensional array a specified by 

ax array [0_n] of integer; 

The rotate statement 
a .*=• rotateC a, 5f, 
is equivalent to 

for i >* 0 to n do 

ofi] > a[(i + 5) mod (n + 1)1 

The rotation utilizes the toroidal end around edge connections of the mesh. The shift func- 
tion is similar except that the mesh is not toroidally connected and zeroes are shifted into ele- 
ments at the edge of the array; therefore, the shift function is not a permutation function in 
the strict sense. The concept of the rotate and shift functions extend to n dimensions; on the 
MPP the last two dimensions of the array c or respond to the parallel hardware dimensions and 
are executed in parallel, higher dimension operations are implemented in serial. The cost of 
the rotate function is dependent on the distance rotated. It also depends on the size of the data 
elements to be permuted. 
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3.1. Matrix Permutation 

The matrix permutation algori thm which is the basis upon which the rotation algorithm 
is constructed, is a general algorithm for implementing arbitrary permutations of a two 
dimensional matrix on mesh connected parallel processors. [73. It is also capable of performing 
any onto mapping. It uses a heuristic approach to reduce the execution time. 

The permutation of a matrix a is specified by two coordinate matrices c and r which 
have similar dimensions to a. The permuted matrix b also has the same dimensions as a. For a 
matrix element b[i,j] the corresponding elements r{i,j] and c[i,j] specify the row and column 
indices respectively of where the related element of a is located. That is, the permutation is 
specified by 

tfij] > 4 Hi jl Hij] 3 

More formally, the data arrays involved in the permutation are specified by: 

ajb : array [l_nrow,l_acol] of data; {where data is any base type} 
r : array [l-nrow,l-ncol] of 1-nrow; 
c : array [l-nrow,l-ncol] of 1-ncol; 

In order to compute the relative distance that the data must be moved, two pixel ele- 
ment identifying matrices idr and idc are precomputed. They contain the following: 

idrfij] > i 
idHij] > j 
for all ij. 

The relative distances to be moved are then specified by 

rr > (r-idr) mod nrovr, 
rc > (c-tdc) mod neat, 

In a permutation the data may be shifted in any of the four quadrants in order to reach 
a specified destination. However, in the following algorithms only positive data shifts are 
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considered, Le. in the up and left directions. The other three quadrants are covered by using 
modulo ari thm etic for shift distance calculations and implementing data movement with the 
rotation function which utilizes the end around mesh connections;. We have investigated a 
modified heuristic algorithm which checks all four quadrants and moves in the optimal direc- 
tion. Fewer data shift operations are required but the overhead due to checking alternative 
directions is significantly higher. 

3.1.1 . A Simple Permutation Algorithm 

A simple naive algori thm to achieve an arbitrary permutation is to slide a over all the 
possible positions of b, assigning the specified elements of a to each element of b when they are 
in the correct position. 

for 1 to nrov v do 
begin 

for /> 1 to ncoL do 
begin 

where (rr - i) and (re =* j) do 
b > <£, 

a :=* rotate(a, 0, l); 
end; 

a >• rotateCa, 1, 0); 
end; 

This algori thm involves OQi 2 ) operations for an n x n matrix. 

3.1. 2. The Heuristic Algorithm 

In many permutations which occur in practice there are well defined patterns for the 

For example, near neighbor shifts are trivial with complexity 0 (l), perfect shuffles can 
be implemented in O (ji ) time. The heuristic algorithm attempts to take advantage of the fact 
that rr and rc will be the same or similar for many elements. This is particularly true for 
operations such as matrix warping. 

The algorithm first slides (rotates) a as many locations up and left as possible such that 
future backtracking will not be necessary. If any element of a is correctly positioned over b 
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(ie. (rr - 0) and (rc - 0)), then b is updated. Otherwise, atr, which is a copy of the current 
version of a, is slid in the upwards direction until all outstanding elements of b, for which 
the current re « 0, are satisfied. The algorithm then shifts as far as possible up and left again 
and repeats the above procedure until all elements of the result mask, are false, Le. b is com- 
plete. 

The following variables are used in the algorithm: 

Variable declaration 

maskjnasktr : array [l-nrow, 1 jacoi] of boolean; 

atr : array [l_/irow, l«n col] of data; 

rrt : array [l_/trow, l-ncci] of Q-nravr, 

ri, rit, lastrit : Q-nrow, 

ci : CL ncob, 

Variable functions: 

mask : the result mask, true values indicate elements of b 

which have not yet received the correct element of a. 
ri, d : row and column distances for the up-left move. 
masktr : a version of mask to process one column. 
rit : a version of ri used to process one column. 
atr : a version of a used to process one column. 
rrt : a version of rr used to process one column. 

Lastrit : the last value of rit. 
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The Parallel Pascal version, of the heuristic algorithm is as follows: 

lastrit >0; 
b > a; 

mask > (rr O 0) or (rc O 0); 

while any (mask, 1 ,2) do 

{ iterate until the permutation is complete } 
begin 

ri minCrr, 1, 2); 
d > min (ret, 1, 2); 

{ move up and left as far as possible } 
a rotate(a, ri, d\ 
rr ^ rr - ri; 
ret- rc - d; 

masktr > (rr - 0) and (rc - 0); 

{ satisfy elements for the current position } 
if any (masktr, 1, 2) then 
atr r- a 

else 

{satisfy each element for the given col umn } 
begin 

where rc - 0 do 
rrt >■ rr 
otherwise 
rrt > nrovr, 
rit > min( rrt, 1, 2); 
masktr >> rrt ~ rit, 

{ the next seven statements implement } 

{ the statement atr =■ rotate (a, rtf, 0) } 

{ but also take advantage of the previous } 
{ shifts } 
if d O 0 then 
begin 

atr > a; 
lastrit > 0; 

end; 

atr > rotate( atr, rit - lastrit, 0 
lastrit > rit, 

end; 

{ update b for the current location of a } 
where masktr do 
begin 

b > atr, 
rr > nrovr, 
rc :=■ ncoli 
mask > false; 

end; 

end; 


This algorithm is bounded by n 2 iterations. However, this must be considered a loose 
bound since we currently do not know a permutation which would require all n 2 iterations. 
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The algorithm requires one iteration for a positive single element shift permutation but n-1 
operations for a negative shift since the rotate is in the wrong direction. 

3.1. 3. Algorithm Cost 

The cost of the naive algorithm is proportional to the number of rotate operations, Le. 
(n. — l) 2 * ( the cost of a one element rotate operation plus two comparison operations ). The 
heuristic algorithm has two major cost components: the rotate operations as noted before and 
the (min) reduction functions. The reduction functions are used to compute the multi- 
element distance for moves. In the tables for the performance of the algori thm, both the total 
number of element rotates and the total number of reduction operations are given. 

The relative cost of a rotation and reduction is both system and data size dependent. For 
the MPP, the cost of a reduction function is in the order of 4.2 fis whereas the single element 
rotation of 32-bit data requires in the order of 23 fis to 9.6 fis depending upon the number 
of successive rotate operations. Therefore, the reduction functions may represent a significant 
portion of the computation cost. With careful low level programming the reduction opera- 
tions can be overlapped with data rotate operations such that their effective cost is in the order 
of 1.4 fis . If the MPP was augmented with a small amount of additional hardware similar to 
that outlined in 10 then the reduction time could be reduced to 1-5 fis over half of which 
could could be overlapped with data rotation operations. The heuristic algorithm always 
requires less iterations and rotations than the naive algorithm; however, the additional over- 
head of the reduction function may make it less efficient in some instances. 

3.1. 4. Permutation Results 

The results of some permutations performed in order to obtain rotated matrices of size 32 
i 32 are given in table 3.1 and 23. These rotations are into mappings rather than permuta- 
tions (see section 3.3.1 for details). For comparison, the naive algori thm requires 961 iterations, 
961 rotate operations and zero reductions for any 32 z 32 matrix permutation or mapping. 
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Table 3.3 contains the results for perfect shuffle permutations for different size matrices: The 
result for perfect and inverse shuffles are identical for any matrix size. 

The perfect shuffle is an example of a perm utation which does not exhibit the locality 
property. The number of algorithm iterations needed to implement shuffles directly is 
in — l) 2 . However, the separability property of the two dimensional shuffle is not being used. 
If we use the permutation algorithm to the permutation in two stages, Le. first shuffle the 
rows and then shuffle the columns, then n — 1 iterations are needed for each permutation. 
Therefore, the perfect shuffle when implemented directly has complexity Oin 2 ) , but when 
computed in two stages the algorithm is much more effective and has Oin ) complexity. The 
results of implementing the perfect shuffle as two separable shuffles are also given in. Table 3.3. 
Table 3.4 shows the results of a random permutation; this demonstrates that the heuristic is 
not effective when the mapping does not possess the locality property. 


2J2. Large Arrays 

Frequently the data to be processed by a parallel processor will be in the format of 
arrays which exceed the fixed range of parallelism of the hardware. Therefore, it is necessary 
to have special algorithms that will deal with large arrays by breaking them down into blocks 
manageable by the hardware, without loosing track of the relationships between different 


Table 3.1: Cost for a near neighbor rotation on a 32 x 32 matrix 

centered at 16 16. 


Angle of rotation 

Matrix rotation mapping cost 

iterations 

rotations 

reductions 

0 

0 

0 

0 

15 

124 

562 

340 

30 

262 

620 

748 

45 

505 

S37 

1464 

60 

741 

1022 

2163 

75 

724 

1019 

2119 

90 

528 

1007 

1552 
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Table 12: Cost for a near neighbor rotation on a 32 x 32 matrix 

centered at 1 1. 


Angle of rotation 

Matrix rotation mapping cost 

iterations 

rotations 

reductions 

0 

0 

0 

0 

15 

235 

304 

666 

30 

437 

517 

1266 

45 

625 

683 

1825 

60 

779 

829 

2292 

75 

889 

956 

2631 

90 

993 

992 

2946 


Table 3.3: Cost for perfect shuffle permutations for 
different matrix sizes. 


matrix size 

Direct Shuffle Cost 

Separable Shuffle Cost 

iterations 

rotations 

reductions 

iterations 

rotations 

reductions 

4x4 

9 

12 

22 

6 

6 

6 

8x8 

49 

56 

134 

14 

14 

14 

16x16 

225 

240 

646 

30 

30 

30 

32 x 32 

961 

992 

2822 

62 

62 

62 


Table 3*4; Permutation cost for a random permutation. 


matrix size 

Random Permutation 

iterations 

rotations 

reductions 

32 x 32 

630 

995 

1851 


blocks. 

One scheme, which is frequently used on the MPP, is to partition the large array into 
blocks which are conveniently stored in a four dimensional array. The range of the first 
dimension of this array specifies the number of blocks in each row of the large matrix and the 
range of the second dimension specifies the number of blocks in each column. Given a concep- 
tual large matrix 


mx : array [0-x.O-y] of btype; 
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which , is to be stored in an array a of type 

array [l-n, 1-m, 1-p, l-q] of btype; 

Element i,j of the large matrix is mapped into the array a as specified by 

ma&j] - a [l+i div p, 1 +j div q, 1+t mod p, l+j mod q] 

For example, a 512 x 256 matrix could be stored in eight blocks as 

la : array [1-4, 1-2,1-128,1-128] of real; 

This data structure allows blocks to be manipulated independently. However, it still 
preserves the positional relationships of those blocks in the original large matrix. 

To simplify the manipulation of large arrays on the MPP, two Parallel Pascal library 
functions IraCate and Ishift have been developed. These functions take an array argument 
and two displacement arguments, like the primitive matrix rotate and shift functions, how- 
ever, in this case the array argument is a four dimensional array which is treated like a con- 
ceptually large matrix. 

Many programs can be converted to operate on blocked rather than conventional 
matrices by simply replacing all instances of rotate and shift with lrotate and Ishift respec- 
tively. This is true for the permutation programs presented; however, in the case of the 
heuristic permutation algori thm, this is not a very efficient solution. A better method is to 
scan through the result blocks and perform permutations on only the input blocks that contri- 
bute to the current result block being processed. This algorithm is shown below. 
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Tar 

laJJbr. array [l_n t l«m,l-arow,l_ncol] of data; 
Irja, array [l_n,l-m,l-nrow,l.mcol] of index; 


begin 

for i - 1 to n do 

for ; - 1 to m do 

begin {process each result block.} 
rb > 1 + Zr[v3 div nrovr, 
cb > 1 + lc(i,j] dir nook, 
ro 1 + b{i,j ] mod nrovr, 
oo >■ 1 + U{i,j] mod ncok, 
for k - 1 to n do 
for l - 1 to m do 

begin {consider each input block} 
maskb > (rb — k) and (cb - (y, 
if myimaskb, 1, 2) then 
where maskb do 

> perm2(Zo{£ I Zl 
rofiO/naskbX 

end; 

end; 


end; 


Perm 2 is the heuristic algorithm presented previously with the modification that the ini- 
tial mask value is passed as an argument. That is, only elements selected by the mask are per- 
muted. An additional speedup is achieved by this since the heuristic works much better 
when only a subset of elements are to be permuted. 


3.3. Matrix Rotation 

The permutation function described above serves as the basis for a general rotation algo- 
rithm. Three rotation techniques are considered; nearest neighbor, bilinear interpolation and 
bicubic interpolation. A rotation is specified by three parameters; the location of the origin of 
rotation (r<*c o) and the rotation angle 0 . The starting point of all rotation algorithms is the 
generation of the mapping matrices r and c from these parameters. 

3.3.1 . Nearest Neighbor 

The nearest neighbor algori thm is simply an into mapping in which a result element is 
assigned the value of the nearest rotated matrix element. In this case the new row and 
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column coordinate matrices, r and e, are defined as follows 

r[i,j\ - roundC (c 0 “ j)sin(0) + ( i — r<Jcos(.Q) + r 0 ) 

c[i,j\ - roundC (/ — Co)cos(9) + (i — + c 0 ) 

for all i and f, any values of the result which have near neighbors outside the range of the 
input matrix are set to zero. 

In performing a rotation, some elements are rotated off the result matrix and some ele- 
ments are selected which are outside of the input matrix. In our algorithm result elements 
for the latter case are simply set to zero. Therefore we have a permutation in which a subset 
of the input elements map into a subset of the result elements; the size of these subsets 

depends upon the angle and origin of the rotation. The rotation is achieved by using the valid 

elements of r and c with the heuristic permutation algorithm. 

3J L2. Bilinear Interpolation 

For the bilinear interpolation algorithm a result element is computed from a weighted 
sum of the four rotated input matrix elements which surround it. There are two possible 
approaches to implementing this scheme. First, we can compute four permutations; each per- 
mutation acquiring one of the four neighbors for each element. This is called multiple permu- 
tations. This was the approach used until now. 

The second method is to perform one permutation and then seek, the local neighborhood 
of the rotated input matrix for the other near neighbors. The idea being that a local search 
will require less computation than four complete rotations especially when the angle of rota- 
tion is large. The local neighborhood of a single rotated matrix does not contain a complete set 
of the near neighbor elements of the input matrix; some are lost due to grid spacing 
differences. A complete set can be guaranteed, however, if we also include the local neighbor- 
hood of a slightly perturbed, rotated input matrix. This scheme is called the double permuta- 
tion method; both rotated matrices can be computed simultaneously with a single execution of 
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a slightly modified heuristic permutation algorithm. If we map a result element back, to the 
input matrix, it will be surrounded by four elements PI - P4 as shown in figure 12. These 
points are moved to the processing element associated with the result P and the interpolation 
is then computed in parallel. 

For the interpolation algorithms, the matrices, rp and cp, contain the actual locations of 
the rotated elements. 


rp[i,j 3 = (c 0 — jXsinQ ) + (i — r^Xcos9) + r 0 
cp[i t j\ = C j — eoXcorfl) + (i — roXsind ) + c 0 


for all i and j. 

The coordinates of the near neighbors are as follows 
r top ~ |(c 0 — j Xsin 9 ) + Ci — TqXcos 9 ) + r 0 

Cleft ~ jo “ CoXcorfl) + (t — roXrinfl) + c 0 

f bottom ^top "b 1 

c right ^ie/r + 1 


cf 

PI s K s P2 

i I 

rf I I 

I * P 

P3 * * P4 


Figure 3.1. Bilinear Interpolation 
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The interpolation fractions are: 


rf =rp - r w? 


c f = cp - Cuft 

Once the points PI - P4 have been obtained the interpolated result is computed as fol- 


lows: 


P = (1 - cf >(1 - r/>Pl + (l-r/ >c/*P 2 + (l - cf frf*P 3 + cf*rf*P4 
The algorithm for the multiple permutation approach is as follows: 


begin 


c >■ Cu.fi 

b >* coefl » permCo, r, cY, 

{ value of top left neighbor } 
oc + 1 ; 

b r- coef2 * permGx, r /:) + b; 

{ value of top right neighbor } 
r >■ r + 1 ; 

b >■ coef 3 * permCa, r,c) + b; 

{ value of bottom right neighbor } 
oc - 1 ; 

b >• coef 4 * permCo, r,c) + b; 

{ value of bottom left neighbor } 

end; 

Perm is the heuristic permutation function- The final result of rotating a is stored in the 
matrix b. 

The double permutation approach uses a modified permutation function which creates 
the following matrices: 


b [i-j] > d HXjl cfi,jB 

and 

d [i,jj := a{ r[i»jl cfi,j] + 1 ] 


where 
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a is the original matrix 
b is the rotated matrix 
d is the shifted rotated matrix 
r is the row coordinate matrix 

and 

c is the column coordinate matrix 

To avoid loosing the value of the center of rotation, when the origin is located to the right of 
the matrix center, the matrix is shifted left and, therefore, in the equation defining matrix d 
we substitute a negative one for the constant one. 

The second step in the double permutation algorithm is a local search performed on both 
rotated and shifted rotated matrices in order to find all the values of the elements needed for 
the interpolation. The local search has a constant maximum cost for any size matrix. It there- 
fore has an advantage over the multiple permutations approach, since every permutation in 
that approach will become more costly as the matrix size increases. 

For the worst case rotation angle (8 - 45), it has been determined that a local search in a 
5x5 window is sufficient to yield the values of all the elements needed to perform a bilinear 
interpolation. The local search strategy implemented in our algorithm is a spiral search. The 
elements are selected by comparing their row and column coordinates to those needed. Once 
they match, their values can be obtained from the rotated matrix or from the rotated shifted 
matrix. 

3.3. 3. Cubic Interpolation 

The cubic interpolation version of the rotation algorithm is a simple extension of the bil- 
inear interpolation scheme. The first step, finding the coordinate matrices r and c, is identical 
to the b iline ar interpolation case. After obtaining these matrices, the values of sixteen neighbor 
points must be acquired. If the multiple permutations approach is used, then sixteen separate 
permutations will be required. However, with the double permutation scheme only a small 
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extension of the bilinear algorithm is needed. TnstMd of using a 5x5 window, which is the 
case when four points have to be found, a 7 x 7 window is necessary to find sixteen points. 
However, since the element will have rotated in a specific direction, the search window can be 
reduced to a 7 x 5 window. Each row of points needed will use a different 7x5 window of 
search. 

Once all the values needed are found, the bicubic interpolation itself is done by, first, 
performing a cubic interpolation for each of the four rows and, then, perfo rmin g a fifth cubic 
interpolation on the row points obtained. 

As shown in figure 3.3, the reference point for the cubic interpolation computed in step 
one is P 6 . The first four cubic interpolations are performed to obtain points pa, pb, pc and pd. 
The fifth one yields the value of point P. 

3.3. 4. Test Results 

The local search performed in the double permutation algorithm has a constant max- 
imum cost for any size matrix. For any matrix the maximum cost is 100 rotations and 25 
reductions for the bilinear interpolation method and 552 rotations and 525 comparisons for 
the cubic interpolation. 


PI * 

P2 

* 

+pa P3 * 

P4 

P5 * 

P 6 

* 

| 

+pb P7 * 

J 

P8 



1 

1 

* P 


P9 * 

P10 

t 

+CC Pll * 

P12 

P13 * 

P14 

* 

+pd P15 * 

P16 


Figure 3.2. Bicubic Interpolation 
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The results for rotations using bilinear interpolation for a 32 x 32 matrix are given in 
table 3-5 and 3.6 for different centers of rotation- The results of rotations applying cubic 
interpolation are given in tables 3.7 and 1 &. 


3.4* The MPP Implementation of the Rotation Algorithm 


Some simple modifications of the rotation p ro g r ams were necessary for them to run on 
the MPP. However, after these changes were made it was possible to obtain values for the 
number of rotations and reduction functions required during the permutation and, when it 


Table 3-5: Cost of bilinear interpolated rotation centered at 

coordinates 16 16. 


angle of rotation 

Double perm- 

Multiple perm. 

rotations 

reductions 

rotations 

reductions 

0 

1188 

184 

4092 

649 

15 

850 

374 

2488 

1416 

30 

1011 

766 

2544 

2978 

45 

1448 

1503 

3292 

5916 

60 

1858 

2174 

4084 

8602 

75 

1846 

2151 

4081 

8491 

90 

1793 

2001 

4090 

7920 


Table 3.6: Cost of bilinear interpolated rotation centered at 

coordinates 1 1 


angle of rotation 

Double perm. 

Multiple perm. 

rotations 

reductions 

rotations 

reductions 

A 

V 

13 

3 

4 

6 

15 

657 

703 

1342 

2705 

30 

1050 

1302 

2035 

5093 

45 

1389 

1834 

2683 

7238 

60 

1702 

2329 

3291 

9230 

75 

1931 

2669 

3471 

10627 

90 

2086 

2971 

3968 

11780 
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Table 3.7: Cost of cubic interpolated rotation centered at 

coordinates 16 16 


angle of rotation 

Double perm 

Multiple perm 

rotations 

reductions 

rotations 

reductions 

0 

1640 

684 

16368 

2608 

15 

1302 

874 

10072 

5757 

30 

1463 

1266 

10394 

12021 

45 

1900 

2003 

13300 

23668 

60 

2310 

2674 

16341 

34371 

75 

2298 

2651 

16315 

33928 

90 

2245 

2501 

16360 

31664 


Table 3.8: Cost of cubic interpolated rotation centered at 

coordinates 1 1 


angle of rotation 

Double perm 

Multiple perm 

rotations 

reductions 

rotations 

reductions 

0 

552 

525 

272 

30 

15 

1109 

1203 

7683 

11132 

30 

1502 

1802 

9723 

20589 

45 

1841 

2334 

11815 

29104 

60 

2154 

2829 

13793 

37047 

75 

2383 

3169 

15316 

42586 

90 

2538 

3471 

15872 

47088 


exists, the interpolation phases. Tables 3.9 through 3.12 contain those values for the near 
neighbor and the bilinear interpolation rotations. These values show that the heuristic permu- 
tation and rotation algorithms are efficient when compared to the naive algorithm. From these 
values, expected times, included in tables 3.13 to 3.16, were calculated. It was also possible to 
obtain timin gs for the total r unnin g time of the progr am. The total time includes the host 
r unning time, the MPP ru nnin g time and also the time necessary to set up all calls to the 
MPP. Several runs were effected to obtain the average values in tables 3.13 through 3.16. 
The values for the total time obtained in the several runs varied approximately 1-5 sec in each 
direction. Such a large variation is to be expected in the total time since it involves the host 
r unnin g time which is highly dependent on the system load. The host ru nnin g time involves 
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only the input of the parameters necessary for the rotation like the center and the angle of 
rotation. 

The running time on the MPP is calculated using timing functions created by Jim 
Abeles at NASA Goddard Space Center. These timing functions are: 
pfmjpit 

p fm_stani namednteger); 
pfm _sto p( namednteger) 
pfm _close 

They use the CAD perform a nc e monitor to calculate the time taken by a section of code. 
The first function to be called is pfmjpit. It clears a counter register and attaches the perfor- 
mance monitor to the MCU. The section of code to be timed has to be placed between a 
pfm _start call and a pfm _£top call, both using the same integer to identify the section. 
When a call to pfm jtart is found, the counter starts being incremented. Once the 
corresponding call to pfm _stop is reached, a call to the VAX is marie, the VAX reads the 
value from the counter register in the MCU and the difference between the starting and final 
values of the counter is computed. The values accumulated for the different sections of code 
timed and their total are printed when pfm jdose is called. At that time the performance 
monitor is also detached from the MCU. 

Tables 3.13 and 3.14 consist of the timings for the near neighbor rotation, tables 3.15 and 
3.16 for the bilinear interpolation rotation. The MPP times include a variable overhead of 
approximately 0.1 sec for each call made to an MCU-resident procedure. This overhead was 
determined by timin g a call to a dummy MCU-resident procedure. This dummy procedure 
contained no statements in it. This overhead occurs because when an MCU-resident routine is 
called the UNIBUS link between the host machine and the MCU is used. Therefore, the 
tr ansf er of about 1000 bytes involved in calling a routine takes approximately 1 msec. To call 
a routine and then return would take twice that time, taken into account only the transfer of 
infor mation. However, to this has to be added also the time necessary to interface with the 
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VAX operating system- For the near neighbor and bilinear interpolation rotations, four calls to 
MCU-resident routines were made in each case. These calls explain the difference of approxi- 
mately 0-4 sec between the MPP times and the estimated times, which otherwise were very 
similar. 

The estimated times were calculated by considering the cost in microseconds of several 
operations. These costs are based on an instruction cycle time for memory access and operation 
on the MPP of 100 ns. For example, a shift or rotate operation costs 2n where n is 0.1 fi sec 
for Boolean, 0.8 fi sec for integer and 3-2 /i sec for real variables: A cost of 2n was assigned to 
elementary Boolean operations. Arithmetic operations were assigned a cost of 3n when they 
involved two operands and an assignment. 

A comparison between the ratios of the number of shifts to the n 2 shifts for a 32 by 32 
matrix and a 128 by 128 matrix, where n is the size of the matrix, demonstrates that even if 
the size of the matrix increases, the ratios remain very similar for the sam e rotation angles 
which is a good characteristic of this algorithm. Figures 3.4 and 3-5 show these ratios. 


Table 3.9: Cost for a near neighbor rotation on a 
128 x 128 matrix centered at 1 1. 


Angle of rotation 

Matrix rotation mapping cost 

rotations 

reductions 

0 

0 

0 

15 

5114 

7506 

30 

8495 

14046 

45 

11103 

20050 

60 

13504 

25121 

75 

15516 

28802 

90 

16256 

32385 
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Table 3.10: Cost for a near neighbor rotation on a 
128 z 128 matrix centered at 64 64 


Angle of rotation 

Matrix rotation mapping cost 

rotations 

reductions 

0 

0 

0 

15 

9183 

3594 

30 

10446 

7800 

45 

13589 

15777 

60 

16378 

23359 

75 

16375 

24331 

90 

16319 

16384 


Table 3.11: Cost for a bilinear interpolation rotation on a 
128 x 128 matrix centered at 1 1. 


Angle of rotation 

Matrix rotation mapping cost 

rotations 

reductions 

0 

101 

25 

15 

9009 

7555 

30 

15689 

14107 

45 

21246 

20034 

60 

26228 

21192 

75 

30023 

28893 

90 

32614 

32410 


Table 3.12: Cost for a bilinear interpolation rotation on a 
128 x 128 matrix centered at 64 64. 


Angle of rotation 

Matrix rotation mapping cost 

rotations 

reductions 

0 

101 

25 

15 

11510 

3628 

30 

14648 

7826 

45 

21556 

15800 

60 

28282 

23378 

75 

28763 

24382 

90 

24676 

16409 
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Table 3.13: Timing of near neighbor rotation on a 128x128 matrix 

centered at 1 1. 


Angle of rotation 

Matrix rotation timing (sec.) 

total CPU time 

MPP time 

Estimated time 

0 

5-49 

0-54 

0.01 

15 

5-56 

0.61 

0.04 

30 

541 

047 

0.07 

45 

5-50 

0-58 

0.10 

60 

5.46 

0-55 

0.12 

75 

5.15 

0.58 

0.14 

90 

542 

048 

0.16 


Table 3.14: Timing of near neighbor rotation on a 128x128 matrix 

centered at 54 64. 


Angle of rotation 

Matrix rotation timing (sec.) 

total CPU time 

MPP time 

Estimated time 

0 

5.43 

042 

0.01 

15 

5.36 

0.42 

0.03 

30 

5.21 

047 

0.05 

45 

5-50 

0-52 

0.08 

60 

5.35 

0-54 

0.12 

75 

5.37 

0-56 

0.13 

90 

5.45 

0.47 

0.09 


Table 3.15: Timing of bilinear interpolated rotation on a 
128x128 matrix centered at 1 1. 


Angle of rotation 

Matrix rotation timing (sec) 

total CPU time 

MPP time. 

Estimated time 

0 

5-54 

0-57 

0.01 

15 

5.64 

0.65 

0.05 

30 

5.78 

0.82 

0.08 

45 

5.45 

0.48 

0.10 

60 

5.51 

0-54 

0.13 

75 

5.61 

0.62 

0.16 

90 

5-54 

0-57 

0.19 
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Table 3.16: Timing of bilinear interpolated rotation on a 
128x128 matrix centered at 64 64. 


Angle of rotation 

Matrix rotation timing (sec.) 

total CPU time 

MPP time 

Estimated time 

0 

5J0 

0-52 

0.01 

15 

5.39 

0.44 

0.03 

30 

5.41 

0.46 

0.05 

45 

5-53 

0-53 

0.10 

60 

5.44 

0-59 

0.14 

75 

5A6 

0-59 

0.15 

90 

5.41 

0.48 

0.10 
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Angle in Degrees 


Figure 3.3. Ratio of number of rotations to n 2 rotations 
for near neighbor rotation with center of rotation 
at 1 1 and at matrix center. 





Figure 3A Ratio of number of rotations to n 2 rotations 
for b ilinear interpolation rotation with, center of 
rotation at 1 1 and at matrix center. 



CHAPTER 4 


THE ISING MODEL 


Another application program developed for the MPP is an Ising model simulation- The 
Ising model is a simplified model of a magnet in which the atomic magnetic moments or spins 
are arranged on a lattice and can point only up or down- To completely specify the micros- 
copic state of the system, it is enough to assign a value of +1 (up) or -1 (down) to spins 
assigned to each point on the lattice. The Ising model is a suitable model for implementation 
on the MPP since interactions between spins can only occur with the nearest neighbors on the 
lattice. Such a structure maps well onto the physical co nfig uration of the MPP. 

4.1. The MPP Implementation 

The purpose of this Ising model simulation is to calculate the energy as the sum over a 
lattice of the product of neighboring spins divided by the number of interactions between 
those spins. This energy is related to the strength of a spin-spin interaction k as well as to the 
strength of the interaction with an imposed magnetic field h. 

The lattice chosen for this simulation is a three-dimensional array of spins of dimensions 
128 by 128 by 128. 

Since the Ising model has only two possible values for the spins: -1 and +1, the energy 
could be easily calculated by summing over the four possible states of spin interaction. How- 
ever, the number of states is an exponential function of the number of spins. For example, 
even a small 10 by 10 lattice already has 2 100 configurations to sum ovpr. The lattice we con- 
sider in this simulation will have 2 2097152 possible configurations. 

It is obvious that a direct approach to this problem is not possible. Another solution 
would be to randomly select states and, based on them, estimate the sum. Unfortunately, the 


45 



46 


majority of states in most systems with a large number of particles do not make a significant 
contribution, being energetically unfavorable, [ll]. Therefore, random sampling is an 
inefficient and impractical solution. 

The approach taken in most cases and also in this simulation is to generate states with 
the probability they would have in nature. The method is called a Monte Carlo simula tion 
because random numbers are an important feature of this approach. A probabilistic model of 
the system is used to dete rmine numerical solutions. In other words, instead of analyzing the 
system as a whole, we can analyze the behavior of a large number of individual particles. 
Those particles exhibit a random behavior based on the model chosen. By considering each 
particle as a separate experiment and taking the statistical average of many experiments, we 
can obtain the necessary information about the system. [12]. 

The algorithm chosen can be summarized as follows: 

- generate a sequence of configurations by modifying each spin according to 
the energy change, its neighbors influence and a comparison with a random 
number. 

- for each state, which is formed by a sequence of configurations and depends 
on the value of the spin-spin interaction, calculate the average total energy. 

In more details for our simula tion to generate different configurations we start at an 
arbi tr a r y state. That arbitrary state is first created by initializing a set of 40-bit shift registers 
used to generate random numbers and a set of bits, in our case four bits, that will be used as a 
feedback at the next step to generate the new set of random numbers. 

The bits of the shift register set are initializ ed and modified depending on the results of a 
random number generator. The bits of the integer returned are assigned to the shift registers. 
There are no limi ts set on the size of the random integer returned by the random number 
generator. 

At high temperatures the spin-spin interaction k is small and the spins are almost 
independent. On the other hand, at low temperatures the spin-spin interaction k is large and 
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the probability that the spins will be found in the same state increases. Therefore, after the 
first arbitrary state has been defined, for every new value of £ a new table of probabilities for 
a positive spin has to be created. That table gives the probability of a positive spin for each 
possible sum of neighbors according to the energy associated with the spin-spin interaction and 
the imposed magnetic field interaction h. 

In this simulation, we calculate the effect of the six nearest neighbor spins on each spin 
on every point on the lattice. The six nearest neighbors are those located to the up, down, east, 
west, south and north directions. The possible spin sums for the neighbors range from -6 to 
+6. The probabilities for every sum are an exponential function of the spin-spin interaction 
and the magnetic field interaction. They are defined as follows: 

-Jfci+A 

prod (i )= 

r e Ja + k +e~ a ’ rA 

where i is one of the possible sums. 

After the probability table is initialized, the spins in the three-dimensional lattice can be 
recalculated. Our simulation allows us to define the number of configurations or times the 
spin lattice will be recalculated. That determines how many configurations will form a state 
and so, how many configurations will be involved in determining the energy at each state. 

To recalculate the spins in the lattice we proceed plane by plane. There are 128 planes 
and each one of them is a 128 by 128 matrix. For each plane the sum of spins of the six 
nearest neighbors is calculated at each site. Since this simulation was implemented on the 
MPP, a parallel processor with dimens ions 128 by 128, all those sums can be calculated in 
parallel. 

After the sum of the neighboring spins is calculated, a new state is calculated by using 
the set of bits and shift registers defined at the beginning of the simulation. It is here that the 
set of bits is used, by exclusive-oring them with the old register array, to create the new regis- 


ter array. 
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Titan, the spins are reset according to the probability table. However, all the spins are 
not reset at once. A boolean mask is initial}?!-*! to a checkerboard pattern and applied to each 
one of the planes of the lattice. Where that mask is true the spins are reset according to the 
spin sum of their neighbors. Once all the possible spin sums have been considered, the mask is 
inverted. 

At that point the same sequence of events, from the computing of all the neighbor sums 
to the resetting of the spins over the masked lattice, is repeated. At that point a new next 
state has been generated. For each one of the states created the average total energy for each 
spin-spin interaction is calculated by taking the mean of the sums over the configurations gen- 
erated. 

Two different implementations of the Ising model were timed on the MPP. Their 
difference was simply on the way the calls were made to the MCU-resident routines. In one 
version, all the calls to the MCU routines were grouped under one general procedure which 
was called by the VAX. In the other version, each one of the MCU routines were called 
directly from the VAX. The only real difference between these two approaches when timing 
is concerned is that the second version takes approximately 0.1 sec for each MCU procedure 
calL In this program, this is not a very significant difference. 

Different parts of the Ising model were timed. Table 4.1 lists average times in 
microseconds for several sections of the Ising program. Section 1 is the random number gen- 
erator that returns a positive random integer. This random number generator is called 16384 
times to completely initialize the random matrix. Section 2 corresponds to the initialization of 
the table of probabilities which gives the probability of a positive spin for each possible sum of 
neighbors. This table is reinitialized every time the strength of the spin-spin interaction 
changes: Section 3 calculates for a given bit plane the sum of spins of neighbors at each site. 
It is computed the number of iterations selected and, for each iteration, it is computed 128 
times so that all bit planes are covered. Section 4 resets the spins according to the probabilities 
calculated in section 2. It is computed the same number of times as section 3. 
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Table 4.1. Timings of sections of the Ising p ro g ram 


Sections of Ising 

Measured Times ( n sec ) Expected Times( n sec) 

Section 1 
Section 2 
Section 3 
Section 4 

16.2 27 

4431.8 4500 

185.2 168 

32290.2 32330 


The expected values were very close to the obtained values when the operations 


involved mainly array manipulations. 








CHAPTER 5 


CONCLUSION 


This thesis presented both tools and algorithms adapted to the Massively Parallel proces- 
sor. The system and programming tools were created to be used in conjunction with the 
Paralel Pascal development and MPP-compiler systems. 

The tools include a command file that combines library preprocessing, translation from 
Parallel Pascal to standard Pascal and compilation in one step, a command file that simplifies 
the compilation, assembly and linking of Parallel Pascal programs and access functions to 
overcome the current lack, of high-level I/O functions to transfer parallel arrays from the 
MPP to the host machine. 

All the features mentioned above create a user friendly environment for p ro gra m 
development and debugging. However, it is still necessary to add new functions to increase 
the usefulness of the system. Examples of such functions are a more complete MPP I/O sys- 
tem and either a preprocessor or a modification of the Parallel Pascal compiler that would 
indicate when functions used by a program are not yet implemented. 

Using the development system and tools mentioned above some algori thms were coded 
and tested directly on the MPP hardware after being developed on the serial host. An 
effective heuristic algorithm was developed for arbitrary permutations and data mappings for 
the MPP was presented. This algori thm is very effective when large arrays are to be pro- 
cessed, when few elements are to be moved or when many elements share a .similar motion 
like it is the case with small angle rotations and well behaved permutations. An effective 
technique for matrix rotation interpolation, involving a local search scheme and based on the 
above permutation algorithm was also studied. 
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It was possible to obtain results and timing measurements from the MPP for the algo- 
rithms mentioned above. The timing results aggree well with the estimated times. The 
estimated times were computed using average times for the matrix assignements, rotations and 
reduction functions, and the number of rotations and reductions performed during the per- 
mutation and interpolation phases of the rotation algorithms: 

Another algorithm implemented on the MPP was the Tsing model. This model is ideal 
for implementation on the MPP since interactions between atomic magnetic moments or spins 
only occur with the nearest neighbors on the lattice and most operations are on Boolean data. 
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6.1. \VMS cnmmanri fil e used to implement on command 

L- — 


S! 

$ 1 compile a parallel pr o gr am using ppt 

S! 

$ echo >■ write sysSoutput 
$ extern ?• SsysSusrlfreeves.ppascal]extem 
S ppt >• SsysSusrlireeve&ppascaljppt 
$! 

$ on e rror then goto err 
$1 

$ int - 0 
$ 1st .*» 

$ ppf > 

$ lbs 

$ Jforeach i (Sargv) 

$ count - 1 
forloop: 

$ Jswitch (Si) 

S arg - pteount* 

S if arg .eqs. "" then goto endfor 
$ lease -i: 

$ if arg -aes. "-I* then goto c2 
$ int =» 1 

$ goto endsw 

S tease -s 
c2: 

$ if arg mes. "-S" then goto default 
$ 1st «* ’arg* 

S goto endsw 

default: 

$ 

$ 

$ 

$ 

$ 

$ 

$1 
pp: 

S 

endsw: 

$ count - count + 1 
S goto forloop 
endfon 
S! 

S name - " "fSparseCppf*/ NAME* T 
% ext - ’"fSparseCppf^’TYPE 
$ dir - ' ”fSparse(ppf DIRECTORY* 7* 

$ ! 

S if ext mes. " .PP" then exit !no .pp hie so do nothing 


name - " "fSparseCar^* NAME" 7" 
ext - ""fSparseCarg^’TYPET 
dir - " "fSparse(arg^" DIRECTORY" 7" 
if ext mes. " .PP" then goto pp 
ppf > ’arg’ 
goto endsw 

else 

lbs > 'lbs' " " ’arg' 
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Sc- S * * * * * 11 "dir” game'" 

S b - " VJ" 

$ d - "V-obf 
$ 1 - " V.li s" 

$ echo "*** Pascal Library Processor and ***" 

$ echo "*** Parallel Pascal Translator ***" 

S {extern $lst Slbs <5ppf I ppt >Sb 
S ass/user ’ppf* sysSinput 
S ass/user pptemp.tmp sysSoutput 
$ extern Tst* lbs' 

S ass/user pptemp.tmp sysSinput 
S ass/user V sysSoutput 
S ass/user sysSerror terminal 
$ ass/user T pplist 
Sppt 

S deass sysSinput 
S deass sysSoutput 
S del pptemp.tmp.* 

S !tail -1 pplist I grep -s 'No erro r s * 

S! 

S fif (Sstatus — 0) then 
S open/read pplist pplistdat 
S newline 
loopl: 

S oldline - newline 

S read/end_of Jile - exitloop pplist newline 

S goto loopl 

exitloop: 

S close pplist 
S if oldline mes. 

"Syntax Analysis Complete, No errors detected." then goto err 

S if int me. 1 then goto comp 
S echo "*** Pascal Translator ***" 

S ! pi -w Sb 

S echo "No interpreter on VMS." 

S exit 

S telse 
comp: 

S echo "*** Pascal Compiler and Linker ***" 

S pas V 

S link’d’ 

S exit 

err: 

S echo " *** No Compilation ***" 

S del pptemp.tmp.* 

S exit 

62. UNIX shell file used to implement dv command 
# 

# compile a parallel program using ppt 

# 

set int = 0 
set 1st - 
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set ppf - "" 
set lbs - "" 
foreach i (Sargv) 
switch (SO 
case -i: 

set int - 1 
breaksw 

case -s 

set 1st - Si 
breaksw 

default: 

ifC Sue pp) then 
set ppf - Si 
else 

set lbs - (Slbs SO 
breaksw 

endsw 

end 

if( Sppfa f pp) exit #no .pp file so do nothing 
set b - Sppfar.p 
set c - Sppfrr 

echo "*** Pascal Library Processor and ***" 

echo "*** Parallel Pascal Translator ***" 

extern Slst Slbs . <5ppf I ppt >Sb 

tail -1 pplist I grep -s "No errors" 

if (Sstatus — 0) then 

if (Sint =— 1 ) then 

echo "*** Pascal Translator ***" 

pi-w Sb 

else 

echo "*** Pascal Compiler ***" 

pc Sb -w -o Sc 

endif 

else 

echo ’*** No Compilation ***" 

ftr trlif 

6.3. VMS command file used to implement rmvv command 

SI BUILD THE NECESSARY FILES TO EXECUTE A PROGRAM 
SI ON THE MPP. 

SI 

SI Get the name of the file 
S 

S if pleqs."" then inquire "file : " pi 
S pa - "”pl’" + ".pp* 

S tempa - "t" +• ""pi*" + ".pp" 

S assign/user 'pa' sysSinput 
S assign/user 'tempa' sysSoutput 
S run [reevesjnpplmppextem 
S deassign sysSinput 
S deassign sysSoutout 
Stemp - V + ’’’pi’" 

S! 
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$1 Execute the compiler/code-geaerator 

si 

$ PPDEV ’temp’ 

SI 

SI Delete unnecessary files. 

S! 

$ DELETE 'temp'.pctbVtemp’xgh* 

SI 

SI Assemble the vax program. 

SI 

S MACRO 'temp' 

SI 

SI Assemble the mpp program. 

SI 

S MCL ’tempVlis/obj 
SI 

SI Link the mpp programs. 

SI 

S MPPLINK/deb [de vaney.primre lease] prlfix-psl/lib/map - 
SI 

SI Link the vax routines with the mpp symbol tables. 

SI 

SCADLNK ’tempV tempest bJ’PDEVRUN/lib 
SI 

SI The following message informs you how 

SI to run the program. It is not executed 
SI from this procedure! 

SI 

$ CLR 

S WRITE SYSSOUTPUT "To run the pr o gra m type : CAD "temp* ”texnp’ "temp’* 
SEXIT 

6.4. Printmvv Function 
#printmpp 

{ function printmpp (x,y : integer, matrix : plr):btype; extern; } 

{ $1 $2 $3 SO } 

var 

mask : parallel array [1-128,1-128] of boolean; 
temp : S3; 
result : SO; 
zero : integer; 

begin (* of printmpp *) 

zero > 0; 
ma.sk > true; 

mask :=• not(shift(mask,-l,zero) or shif tCmaskxero,- 1 )); 
mask > shift (mask,-x+l,-y+l); 
temp >* 0; 
where mask do 
temp :=■ matrix; 
result :=■ sum (temp,l,2); 
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printmpp >• result; 
end; (* of printmpp *) 

6-5. Printmvvb Function 
#printmppb 

{ function printmpp (x,y : integer; matrix : plb):booleaa; extern; } 

{ $1 S2 S3 SO } 

{ where plb is a parallel two-dimensional array of type boolean } 


var 

mask : parallel array [1-128,1-128] of boolean; 
temp : S3; 
result : SO; 
zero : integer; 

begin (* of printmpp *) 

zero >» 0; 
mask >• true; 

mask not(shifKmask,-l,zero) or ahiftCmask^ero,-!)); 
mask .*=■ shift (mask,-x+l,-y+l); 
temp » false; 
where mask do 
temp > matrix; 
result >■ or (temp.1,2): 
printmpp :» result; 

end; (* of printmpp *) 
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The Massively Parallel Processor is a SIMD computer with 
16384 processing elements connected la a 128 x 128 mesh. Such 
an organization is ideal for problems which Involve near neighbor 
iterations, but for other problems which Involve other data map- 
pings it is often considered to be inefficient. In this paper a general 
algorithm for implementing arbitrary permutations and mappings 
on such systems is presented. Efficient matrix rotation- algorithms 
based on this permutation function are also rfisniweri. Nearest 
neighbor, bilinear interpolation and bicubic spline interpolation 
schemes are considered. These algorithms are extended for the case 
when the matrix to be processed is larger than the parallel 
hardware dimensions. 


INTRODUCTION 

A convenient way to interconnect a very large number of 
processors is in a two dim.nnnnai grid or mesh; this interconnec- 
tion arrangement is very simple to implement, has a cor which 
increases linearly with the number of processors and is very suit- 
able for a large number of algorithms. An example of such a sys- 
tem is the Massively Parallel Processor Cl] which involves 16384 
bit-serial proce s sors organized in a 128 z 128. The MPP is pro- 
grammed in a high level language called Parallel Pascal (2). 

The only permutation function which is directly imple- 
mented by the MPP is the near neighbor rotate (or shift). The 
direction of the rotation may be in any of the four cardinal direc- 
tions. in Parallel Pascal the main permutation functions are 
multi-element rotate and shift functions; other permutations are 
built on these primitives. 

The rotate function takes as arguments the array to be 
shifted and a displacement for each of the arrays dimensions. For 
example consider a one dimensional array a specified by 

a. array [0_a] of integer; 

The rotate statement 
a. rotate( a. 3): 
is equivalent to 

for i > 0 to n do 

<a£i] >• af(» + 3) mod (n + 1)1 

The rotation utilizes the toroidal end around edge connections of 
the mesh. The shift function is similar except that the mesh is 
not toroidaily connected and zeroes are shifted into elements at the 
edge of the array; therefore, the shift function is not a permuta- 
tion function in the strict sense. The concept of the rotate and 
shift functions extend to n dimensions on the MPP the last two 
dimensions of cha array correspond to the parallei hardware 
dimensions and are executed in parallei, higher dimension opera- 
tions are implemented in serial The cost of me rotate function is 
dependent on the distance rotated. It also depends on die size of 
the data elements to be permuted. 

There is ao simple known way to decompose in arbitrary 
permutation into a minimal sequence of operations on an MPP like 
system. In this paper a heuristic algorithm is described. The algo- 


rithm exploits the local consistency of data which occurs in many 
practical applications; An effective application of this algorithm to 
matrix rotation is presented. For some permutation* such as the 
perfect shuffle, which do not directly exhibit this consistency pro- 
perty, the algorithm may not be very effective. 

For many applications the physical dimensions of the parallel 
hardware are smaller than the dimensions of the array to be pro- 
cessed. in this case the data array is processed as a set of blocks. An 
extension of the p e rm utation algorithm to deal with this situation 
is discussed. 

The pro gra m and algorithm examples given in this paper use 
the Parallel Pascal notation. This notation involves three exten- 
sions to standard Pascal: 

1) expressions involving whole arrays are permitted; 

2) the where - do - otherwise control statement is available. 
This statement is a parallel version of the if - then - else 
statement; the control expression must evaluate to a Boolean 
array. All array assignments within the controlled state- 
ments must be conformable with the control array and are 
masked by in 

3) the functions any and min are the array reduction functions 
or and minimum respectively. 

MATRIX PERMUTATIONS 

The matrix permutation algorithm presented in this paper is 
a general algorithm for implementing arbitrary p er m utations of a 
two dimensional matrix on mesh connected parallel processors. It 
is also capable of performing any onto mapping. It uses a heuristic 
ap p r oa ch to reduce the execution time. 

The permutation of a matrix a is specified by two coordinate 
matrices c and r which have similar dimensions to a. The per- 
muted matrix 4 also has the same dimensions as a. For a matrix 
element the corresponding elements rfi.il and cfi.il specify the 
row and column indices respectively of where the related element 
of a is located. That is, the permutation is specified by 

Sfi.il > <4 Ki.il cCi.il ] 

More formally, the data arrays involved in the permuted, a 
are specified byt 

aj> : array U-irow.l-acoi] of data; 

(where data is any base cype] 
r : array (Lmrow.l -ncoil of l-nrow; 
c : array (i-arow,i_acoii of l-ncot 
In order to compute the relative distance that the data must 
be moved, two pixel element identifying matrices idr and i dc are 
precomputed. They contain the following: 

ifrfi.il i: 
idcf i.ii > i 
for ail i,i. 
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The relative riimmw- to be moved an then specified by 

rr ( r-idr ) mod nrow; 

re y- (c -ide) mod ncaii 

In a permutation the data may be shifted in any of the four 
quadrana in order to reach a specified destination. However, in 
the following algorithms only positive data shifts are considered, 
Le. in the up and left directions The other three quadrants are 
covered by using modulo arithmetic for shift distance calculations 
and implementing data movement with the rotation function 
which utilizes the end around mesh connections We have investi- 
gated a modified heuristic algorithm which checks ail four qua- 
drants and moves in the optimal direction. Fewer data shift opera- 
tions are required but the overhead due to checking alternative 
directions is significantly higher. 

A simple permutation algorithm 

A simple naive algorithm to achieve an arbitrary permuta- 
tion is to slide a over ail the possible positions of 4, assigning the 
specified elements of a to each element of b when they are in the 
correct position. 

for c- l to nrow do 
begin 

for jtm 1 to ncai do 
begin 

where (rr - i) and (re - j) do 
4 at 

a r- rotateCa. 0. li 
end; 

a rotateCa, 1. (ft 
end; 

This algorithm involves OGt 2 ) operations for an a X n 
matrix. 


The Heuristic algorithm 

In many permutations which occur in practice there are well 
defined patterns for the data. For example, near neighbor shifts are 
trivial with complexity 0(lX perfect shuffles can be implemented 
m 0 (a) time. The heuristic algori thm attempts to taka advantage 
of the fact that rr and re will be the «am« or similar for many 
elements. This is particularly true for operations such as matrix 
warping. 

The algorithm first slides (rotates) a as many locations up and 
left as passible such that future backtracking will not be neces- 
sary. If any element of a is correctly positioned over b (is. (rr - 
0) and (re - 0)) then 4 is updated. Otherwise, arr, which is a copy 
of the current version of a, is slid in the upwards direction until 
ill outstanding elements of 4, for which the current rc - 0, are 
satisfied. The algorithm then shifts as far as possible up and left 
again and repeats the above procedure until all elements of the 
result mask are false, ie. 4 is complete. 

The following variables are used in the algorithm ; 

Variable declaration 

maskjnasktr : array {Unrcw, t_ncoi] of boolean; 
arr : array (l_nrow, l-octxi of data; 
rrt : array [ljt row, l_ncoll of 0-nrow, 
ri, ric, lastrit : Q-nrow, 
d : O-tcoti 

Variable functions: 

mask : the result mask, true values indicate elements of 4 
which have aot yet received the correct element of a. 


ri, d : row and column distances for the up-left move. 
masker : a version of mask to process one column. 
rit : a version of ri used to process one column, 
arr t a version of a used to p ro cess one column, 
rrt : a version of rr used to procea one colu mn. 
lastrit : the last value of rit 

The Parallel Pascal version of the heuristic algorithm is as 
follows 

lastrit 
4 > a; 

mask (rr O 0) or (re O 0); 
while anyOnask, 1 J) do 

begin { iterate until the permutation is complete ) 
ri > minCrr, 1, 2); 
d > tninCre, 1, 2): 

a a- rotateCa, ri, d); { move up end left as far as possible ) 
rr :« rr - ri; 
re rc - d; 

masker (rr — 0) »nd (re - 0); 
if anyOnoxkrr, L 2) then { satisfy elements for the 
current position ) 

arr f- a 
else 

begin (satisfy each element for the given column) 
when rc- 0 do 
rrt> rr 
otherwise 
rrt nrow, 
rit > minC rrr, 1, 2); 
masker r- n-r - rie, 

{ the next seven statements implement ) 

( me statement aer - rotate (a. rit, 0) ) 

( but also take advantage of the previous shifts I 
if d o 0 then 
begin 
arr » a: 
lastrit r- 0; 
end; 

arr rotate( arr, rit - lastrit, 0 ): 
lastrit a- rit; 
end: 

where masktr do (update b for the currant location of a) 
begin 
4 arr; 
rr > nrow; 
rc 5- ncofi 
mask >• false; 
end; 

end; 

This algorithm is bounded by a 2 iterations. However, this 
must be considered a loose bound since we currently do not know 
a permutation which would require all n 2 iterations. The algo- 
rithm requires one iteration for a positive single element shift per- 
mutation but a-1 operations for a negative shift since the rotate is 
in the wrong direction. 


Algorithm Cost 

The cost of the naive algorithm is proportional to the number 
of rotate operations. Lee n 2 * ( the cost of a one element rotate 
operation plus two comparison operations ). The heuristic algo- 
rithm has two major cost components : the rotate operations as 
noted before and the ( mini reduction function*. He reduction 
functions are 'used to compute the multi-element distance for 
move* In the tables for the perform a n ce of the algorithm, both 
the total number of element rotates and the total number of 
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reduction operations are given. 

The relative con of a rotation and reduction is both system 
and data size dependent. For the MFP, the cost of a reduction 
function is in the order of 42 its whereas the single element rota- 
tion of 32-bit data requires in the order of 22 ns to 9.6 its 
depending upon the number of. successive rotate operations. There- 
fore, the reduction functions may represent a significant portion of 
the computation cost. With careful low level prog ra mming the 
reduction operations can be overlapped with data rotate operations 
such that their effective cost is in the order of L 4 its. If the MPP 
was augmented with a small amount of additional hardware simi- 
lar to that outlined in [3] then the reduction time could be 
reduced to U its over half of which could could be overlapped 
with data rotation operations. The heuristic algorithm always 
requires lea iterations and rotations than the naive algorithm; 
however, the additional overhead of the reduction function may 
make it less efficient in some instances. 


Permutation Results 

The results of some permutations performed in order to 
obtain rotated matrices of size 32 z 32 are given in Table 1 and EL 
These rotations are into mappings rather than permutations (see 
the matrix rotation section for details). For comparison, the oaive 
algorithm requires 1024 iterations, 1024 rotate operations and zero 
reductions for any 32 z 32 matrix permutation or mapping. Table 
in contains the results for perfect shuffie permutations for 
different size matrices. The result for perfect and inverse shuffiea 
are identical for any matrix size. 


TABLE t Cost for a near neighbor rotation on a 
32 z 32 matrix centered at 16 16. 


Angle of rotation 

Matrix rotation mapping cost 

iterations 

rotations 

reductions 

0 

0 

0 

0 

15 

124 

562 

340 

30 

262 

620 

748 

45 

505 

837 

1464 

60 

741 

1022 

2163 

75 

724 

1019 

2119 

90 

52S 

1007 

1552 


TABLE It Cost for a near neighbor rotation on a 
32 z 32 matrix centered at 1 1. 


The perfect shuffie is an example of a permutation which 
does not exhibit the locality property. The number of algorithm 
iterations needed to implement shuffles directly is (n — I f. How- 
ever. the separability property of the two dim»n<inn»i shuffle is 
not being used. If we use the permutation algorithm to the per- 
mutation in two stages; La. first shuffle the rows and then shuffle 
the columns, then n — 1 iterations are needed for each permuta- 
tion. Therefore, the perfect shuffle when implemented directly 
has complexity 0(s 1 3 ) , but when computed in two stages the algo- 
rithm is much more effective and has 0(n) complexity. The 
results of implementing the perfect shuffle as two separable 
shufflee are also given in Table 1IL Teble IV shows the results of a 
random permutation: this demonstrates that the heuristic is not 
effective when the mapping does not posses the locality property. 


TABLE IV: Permutation cast for a random permutation. 


matrix size 

Random Permutation 

iterations rotations reductions 

32 x 32 

630 995 1851 


LARGE ARRAYS 

Frequently the data to be processed by a parallel p rocessor 
will be in the format of arrays which exceed the fixed range of 
parallelism of the hardware. Therefore, it is necessary to have 
special algorithms that will deal with large arrays by breaJting 
them down into blocks manageable by the hardware, without loos- 
ing track, of the relationships between different blocks. 

One scheme, which is frequently used on the MPP. is to par- 
tition the large array into blocks which are conveniently stored in 
a four dimensional array. The range of the first dimension of this 
array specifies the number of blocks in each row of the large 
matrix and the range of the second dimension specifies the number 
of blocks in each column. Given a conceptual large matrix 

mx ■■ array [0-x.O.y] of btype: 

which is to be stored in an array c of type 

array [l_n. l_m. Up, l_qj of btype: 

Element i.y of the large matrix is mapped into the array a as 
specified by 

msfi.jj - a (l+i dtr p, l+j dir q, l+» mod p, 1+/ mod q] 


Angle of rotation 

-Matrix rotation mapping cost 

. 

iterations 

rotations 

reductions 

0 

0 

0 

0 

15 

235 

304 

666 

30 

437 

517 

1266 

45 

625 

683 

1825 

60 

779 

829 

2292 

75 

889 

956 

2631 

90 

993 

992 

2946 


For example, a 512 X 256 matrix could be stated in eight blocks ax 
la : array [X-4J_2J_:2S,1_128] of reaJt 

Thjj data structure allows blocks to be manipulated indepen- 
dently. However, it still preserves the positional relationships of 
those blocks in ths original large matrix. 

To simplify the manipulation of large arrays on the MPP, 
two Parallel Pascal library functions Irocat * and [shift have been 
developed. These functions take an array argument and two dis- 


TABLE Ht Cost for perfect shuffle permutations 
for different matrix sizes. 


matrix size 

Direct Shuffle Cost 

Separable Shuffle Cost 

iterations 

rotations 

reductions 

iterations 

rotations 

reductions 

4x4 

9 

12 

22 

6 

6 

6 

8 X 8 

49 

56 

134 

14 

14 

14 

16 x 16 

225 

240 

646 

30 

30 

30 

32x32 

961 

992 

2822 

62 

62 

62 


i 

I 
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piacemeat arguments, liica the prumuve matrix rotate and shift 
functions, however, in this case the array argument is a four 
dime n sio n al array which is treated like a conceptually large 
matris. 

Many programs can be converted to operate on blocked 
rather than conventional matrices by simply replacing ail 
instances of rotate and shift with [rotate and [shift respectively. 
This is true for the permutation programs presented: however, in 
the case of the heuristic permutation algorithm, this is not a very 
efficient solution. A better method is to scan through the result 
blocks and perform permutations on only the input blocks that 
contribute to the current result block being processed. This algo- 
rithm is shown below. 


TABLE Vt Comparison of perm2 and large blocked permutation 
for a rotation centered at coordinates 16 16. 


angle of rotation 

Perm2 

Large perm. 

rotations comparisons 

rotations 

comoarisons 

0 

nas 

184 

3472 

1552 

15 

850 

374 

761 6 

7552 

30 

ion 

766 

11520 

15408 

45 

1448 

1503 

16544 

25232 

1 50 

1858 

2174 

21328 

33552 

75 

1846 

2151 

25056 

37760 

' 90 

1793 

2001 

22544 

33120 


▼ar 

laMr. array (1 ,.n,l--n.l-a.row.l_acoll of data; 
Iria array (l-aU-m.i_arow,l-ncoi] of index: 


begin 

for i - 1 to n do 

for / - 1 to m do 

begin (process each result block! 
rb »■ 1 + Irfe/J dir nnwh 
cb » 1 + fcfe/J dir neofc 
ro 1 + Irfe/j mod nrvw, 
co » 1 +• itfe/j mod ncofc 
for k - 1 to n do 
for t « 1 to m do 

begin (consider each input block! 
maskb (rb . Ic) and (c6 - 0: 
if aayOmuki, 1. 2) then 
where maskb do 

life/] r* perm2 (la [fci] . 
ro, co, nutskbY, 

end; 


end; 


end; 


Perm 2 is the heuristic algorithm presented previously with 
the modification that tfre initial value is p*«wd as an argu- 
ment. That is, only elements selected by the mash are permuted. 
An additional speedup is achieved by this since the heuristic 
works much better when only a subset of elements are to be per- 
muted. 

Table V contains the results of the rotation mapping for the 
case where a 32 x 32 matrix is considered co consist of 4 x 4 blocks 
of 3 x 3 elements. The Large Perm results are from using the [ro- 
tate approach and the perm 2 results are for the block scanning 
algorithm. 


TABLE V: Comparison of per m2 and large blocked permutation 
for a rotation centered at coordinates 1 1. 


angle of rotation 

Perm2 

Large perm. 

rotations 

reductions 

rotations 

reductions 

0 

13 

3 

0 

0 

15 

657 

703 

6768 

7184 

30 

1050 

1302 

7616 

10704 

45 

1389 

1834 

8352 

13152 

60 

1702 

2329 

6896 

11248 

75 

1931 

2669 

4112 

5808 

90 

2086 

2971 

1856 

1408 


MATRIX ROTATION 

One application of the perm utation function is matrix rota- 
tion mapping. Three rotation r>eHniqu»« are considered : nearest 
neighbor, bilinear interpolation and bicubic interpolation. A rota- 
tion is specified by three parameters : the location of the origin of 
rotation (roCo) and the rotation angle 3 . The starting point of all 
rotation algorithms is the generation of the mapping matrices r and 
c from these parameters. 

Nearest neighbor 

The nearest neighbor algorithm is simply an into mapping in 
which a result element is assigned the value of the nearest rotated 
matrix element. In this case the new row and col umn coordinate 
matrices, r and c. are defined as follows; 

r[i,j\ - round( (c 0 * /)svi(0) + (i - r o )cor(0) + r 0 ) 

c[ij] -round((/ - coX».r(0) + O' — ro)sin($) + c 0 ) 

for all i and any values of the result which have near neighbors 
outside the range of the input matrix are set to zero. 

la performing a rotation, some elements of the result matrix 
are rotated and some elements are selected which are outside of 
the input matrix. In our algorithm result elements for the latter 
case are simply set to zero. Therefore we have a permutation in 
which a subset of the input elements map into a subset of the 
result elements; the size of these subsets depends upon the angle 
and origin of the rotation. The rotation is achieved by the 
valid elements of r and c with the heuristic permutation algo- 
rithm. 

Examples of nearest neighbor rotation are shown in Fig. 1. 
for an 3 x 3 matrix with the origin of rotation located at (l, l). 
For a small angle of rotation moot of the results have valid 
mapped values however, the heuristic algorithm is very effective 
for this case because there is little movement of the data or rr and 
cr are the same for many elements. 

When the rotation angle is large there is a lot of data rear- 
rangement but only a few elements are to be moved with the 
rotation located at (1, lX For the naive algorithm 49 iterations are 
needed for all rotations. 

Bilinear Interpolation 

For the bilinear inter polation algorithm a result element is 
computed from a weighted sum of the four rotated input matrix 
elements which surround it. There are two possible approaches to 
implementing this scheme. First, we can compute four permuta- 
tions; each permutation acquiring one of the four neighbors for 
each element. This is called multiple permutation. 

The second method is to perform one permutation and then 
ieek :he local neighborhood of the rotated input matrix for the 
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Figure 1. Sample Nearest Neighbor Rotations 


other near neighbors The idea being that a local search will 
requite lea computation than four complete rotations ea penally 1 
when the angle of rotation is large. The local neighborhood of a: 
single rotated matrix does not contain a complete set of the near- 
neighbor elements of the input matrix; some are lost due to grid- 
spacing differences. A complete set can be guaranteed, however, if: 
we aiso include the local neighborhood of a slightly perturbed,: 
routed input matrix. This scheme is called the double permutation 
method; both routed matrices can be computed simultaneously: 
with a single execution of a slightly modified heuristic permuta- 
tion algorithm. 

If we map a result element back, to the input matrix it will! 
be surrounded by four elements PI - P4 as shown in Fig. 2. These: 
points are moved to the processing element associated with the: 
result P and the interpolation is then computed in parallel. 

cf 

PI « , « P2 

I I 

if I I 

I * P 

P3 * « ?4 


Figure 2. Bilinear interpolation 


For the interpolation algorithms, the matrices, rp and cp,. 
contain the actual locations of the routed elements. 

rpli.j] = (e» - jXtini) + (1 - roXcor 9) + r„ 
cpi‘,jl = (y — e<>Xeor9) + (i - r^Csind ) + c 0 

for all i and j. 

The coordinates of the near neighbors are as follows 

r„ p 3 j(co — ;Xlwt9) + (i — r 0 Xcor9) + r„ | 
c uft 3 jo ~ CoXcor9) + (i - r-ffsind) m c, j 


rtmmm 3 r n, + 1 

Crtj/v 3 C Uft + l 
The interpolation fractions arc 
rf =rp - r„, 
cf * ep - c uft 

Once the points PI - P4 have been obtained the interpolated 
result is computed as follows: 

P * (l — cf )*Cl — rf YP\ + (1 — rf Pcf 9 P2 + 

(1 - cf PrfP 3 +-c/*r/*?4 
The algorithm for the multiple permutation approach is as 
follows 


begin 


*• •* c Uft* f 

4 coefl " permio. r, cf. (value of top left neighbor 1 
«:»«+!; 

4 :» coefl * permCa. r o) * tr, ( value of top right 
neighbor I 
r :» r + 1 ; 

4 :• coet'3 • permCa. r, c) +• 4; ( value of bottom right 
neighbor i 
es»c - 1 ; 

4 > coef4 • permCa, r, c) * br, \ value of bottom left 
neighbor I 

end; 

Perm is the heuristic permuutioa function. The final result of 
rotating a is stored in the matrix 4. 

The double permutation approach uses a modified permuu- 
tion function which creates the following matrices 

4 [i.j] > a. Ki.il 

and 

d [i,j] > 4 Ki.il KliJ + 1 ] 
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where 

a is the original matrix 
4 is the rotated matrix 
d is the shifted rotated matrix 
r is the row coordinate matrix 

and 

c 15 t ht column coordinate matrix 

To avoid loosing the value of the center of rotation, when the ori- 
gin is located to the right of the matrix center, the matrix is 
shifted left and, therefore, in the equation denning matrix d we 
substitute a negative one for the constant one. 

The second step in the double permutation algorithm is a 
local search performed on both rotated and shifted rotated 
matrices in order to find all the values of the elements needed for 
the interpolation. The local search has a constant maximum cost 
for any size matrix. It therefore has an advantage over the multiple 
permutations approach, since every permutation in that approach 
will become more costly as the matrix size increases. 

For the worst case rotation angle (9 - 45), it has been deter- 
mined that a local search In a 5 x 5 window is sufficient to yield 
the values of all the elements needed to perform a bilinear inter- 
polation. The local search strategy implemented in our algorithm 
is a spiral search. The elements an selected by comparing their 
row and column coordinates to those needed. Once they match, 
their values can be obtained from the rotated matrix or from the 
rotated shifted matrix. 


Cubic Interpolation 

The cubic interpolation version of the rotation algorithm is a 
simple extension of the bilinear interpolation scheme. The first 
step, finding the coordinate matrices r and c. is identica l to the bil- 
inear interpolation case. After obtaining these matrices^ the values 
of sixteen neighbor points must be acquired. If the multiple per- 
mutations approach is used, then sixteen separate permutations 
will be required. However, with the double permutation scheme 
only a small extension of the bilinear algorithm is heeded. 

Instead of using a 5 x 5 window, which is the case when 
four points have to be found, a 7 x 7 window is necessary to find 
sixteen points. However, since the element will have rotated in a 
specific direction, the search window can be reduced to a 7 x 5 
window. Each row of paints needed will use a different 7x5 
window of search. 

Once all the values needed are found, the bicubic interpola- 
tion itself is done by. first, pe r f o r m ing a cubic interpolation for 
each of the four rows and. chen. performing a fifth cubic interpo- 
lation on the row points obtained. 

As shown in Fig. 3, the reference point for the cubic interpo- 
lation computed in step one is P& The first four cubic interpola- 
tions are performed to obtain points pa, pb, pc and pd. The fifth 
one yields the value of point P. 


PI 

P2 

+pa P3 

P4 

P5 

P6 • — 

( 

- -pb P7 

PS 


1 

1 

\ 

' P 


P9 

P10 * 

+pe Pll • 

P12 

?13 • 

P14 • 

+pd P15 * 

P16 


Figure 3. Cubic Interpolation 


Test Results 

The local search perform ed in the double permutation algo- 
rithm baa a constant mmimnm cost for any matrix For any 
matrix the maximum cast is 100 rotations and 25 reductions for 
the bilinear interpolation method and 552 rotations and 525 com- 
parisons for the cubic interpolation. 

The results for rotations bilinear interpolation for a 32 
x 32 matrix are given in Table VQ and Vm for different centers 
of rotation. The results of rotations applying cubic interpolation 
are given in Tables IX and X. 

TABLE VR Cost of bilinear interpolated rotation 
centered at coordinates 16 16 


ingle of rotation 

Double perm. 

Multiple perm. 

rotations 

reductions 

rotations 

reductions 

0 

1188 

184 

4092 

649 

15 

850 

374 

2488 

1416 

30 

1011 

766 

2544 

2978 

45 

1448 

1503 

3292 

5916 

60 

1858 

2174 

4084 

8602 

75 

1846 

2151 

4081 

8491 

90 

1793 

2001 

4090 

7920 


TABLE VTU: Cost of bilinear interpolated rotation 
centered at coordinates 1 1 


angle of rotation 

Double perm. 

Multiple perm. 

rotations 

reductions 

rotations 

reductions 

0 

13 

3 

4 

6 

15 

657 

703 

1342 

2705 

30 

1050 

1302 

2035 

5093 

45' 

1389 

1834 

2683 

7238 

60 

1702 

2329 

3291 

9230 

75 

1931 

2669 

3471 

10627 

90 

2086 

2971 

3968 

11780 


TABLE IX; Cost of cubic interpolated rotation 
centered at coordinates 16 16 


angle of rotation 

Double perm. 

Multiple perm. 

rotations 

reductions 

rotations 

reductions 

0 

1640 

684 

16368 

2608 

15 

1302 

874 

10072 

5757 

30 

1463 

1266 

10394 

12021 

45 

1900 

2003 

13300 

23668 

60 

2310 

2674 

16341 

34371 

75 

2298 

2651 

16315 

33928 

90 

2245 

2501 

16360 

31664 


TABLE X; Cost of cubic interpolated rotation 
centered at coordinates 1 1 


1 

angle of rotation 

Double perm. 

Multiple perm. j 

rotations 

reductions 

rotations 

1 

reductions i 

0 

552 

525 

272 

30 | 

15 

1109 

1203 

7683 

11132 

30 

1502 

1802 

9723 

20589 I 

45 

1841 

2334 

11815 

29104 | 

60 

2154 

2829 

13793 

37047 

75 

2383 

3169 

15316 

42586 ! 

90 

2538 

3471 

15872 

47088 | 
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RESULTS USING THE MPP 

The results for 32 J 32 matrices reported in this paper were 
obtained with a Parallel Pascal Translator which translates Paral- 
lel Pascal into standard Pascal for program development [4], Some 
of these functions have also been run on the MPP; in this case for 
a 128 x 128 array, [n Table XI and Table XU results for near 
neighbor rotations are given and in tables XIH and XIV results for 
bilinear interpolation rotations are given. 

Table XL Cost for a near neighbor rotation on a 
128 x 128 matrix centered at 1 1. 


Angle of rotation 

Matrix rotation mapping cost 

rotations 

reductions 

0 

0 

0 

15 

5114 

7506 

30 

S495 

14046 

45 

11103 

20050 

60 

13504 

25121 

75 

15516 

28802 

90 

16256 

32385 


In Figs. 4 and 5 comparisons are given between the 32 x 32 
and 128 x 128 results. The cost shown is the ratio of the number 
of shift operations required by the heuristic algorithm over the 
number of shifts required by the simple algorithm for a single 
permutation (ix, n 2 ). These ngures show that there is a very good 
correspondence between the the results for the different size 
matrices. That is, for the rotation algorithm the improvement 
achieved with the heuristic algorithm is a constant which which is 
independent of matrix size. 


Table XIL Cost for a near neighbor rotation on a 
128 x 128 matrix centered at 64 64. 


.Angle of rotation 

Matrix rotation mapping cost 

rotations 

reduction* 

0 

0 

0 

15 

9183 

3594 

30 

10446 

7800 

45 

13589 

15777 

60 

16378 

23359 

75 

16375 

24331 

90 

16319 

16384 


Table XLh Cost for a bi line ar interpolation rotation on a 
128 X 128 matrix centered at 1 1. 


Angle of rotation 

Matrix rotation mapping cost 

rotations 

reductions 

0 

101 

25 

15 

9009 

7555 

30 

15689 

14107 

45 

21246 

20034 

60 

26228 

21192 

75 

30023 

28893 

90 

32614 

32410 



Figure 4. Ratio of the number of rotations to n~ rotations for near neighbor rotation with the 
center of rotation at (1,1) and at the matrix center. 
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Table XTV: Cost for a bilinear interpolation rotation on a 
128 x 128 matrix centered at 64 64. 


Angle of rotation 

Matrix rotation maoping cast 

rotations 

reductions 

0 

101 

25 

15 

11510 

3628 

30 

14648 

7826 

45 

21556 

15S00 

60 

28282 

23378 

75 

28763 

24382 

90 

24676 

16409 


CONCLUSION 

An effective heuristic algorithm for arbitrary permutations 
and data mappings for mesh connected SIMD processors has been 
presented. This algorithm is particularly suited to the following 
conditions 

1. When only a few elements are to bo moved. 

2. When many elements share a similar motion, e-g. small angle 
matrix rotation and warping. 

1 When large arrays are to be processed. 


It is less suitable when the permutation, or mapping is dense 
and does not have the locality property. The effectiveness of this 
algorithm over a naive algorithm depends upon the system imple- 
mentation parameters, and the si ze of the data to be manipulated. 

An effective technique for matrix rotation interpolation has 
been presented which involves a local search sche me. Excellent 
results have been obtained especially for bicubic interpolation. 
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Parallel Pascal: An Extended Pascal for Parallel Computers 

Anthony F. Reeves 

School of Fleclrkal Engineering. Cornell University . Ithaca, New York 14853 

Parallel Pascal is an extended version of the convenlional serial Pascal pro- 
gramming language which includes a convenient syntax for specifying array oper- 
ations. It is upward compatible with standard Pascal and involves only a small 
number of carefully chosen new features. Parallel Pascal was developed to reduce the 
semantic gap between standard Pascal and a large range of highly parallel computers. 

Two important design goals of Parallel Pascal were efficiency and portability. Por- 
tability is particularly difficult to achieve since different parallel computers frequently 
have very different capabilities. 

I . Introduction 

There is a large class of mainly scientific problems which require Ihe 
ivailability of highly parallel processors in order to compute results in a 
easonable amount of time. Many diverse parallel computer architectures 
tave been designed for these problems, which usually involve either pipeline 
ir processor array schemes. A common feature of these architectures is their 
ibility to perform some very high speed operations on arrays of data. How- 
ever, there is a very large variation between different architectures in the set 
if array operations which can be efficiently implemented and the size of the 
trrays which can be efficiently processed. The term “parallel computer" in 
his paper is used to indicate systems which have high-speed array-processing 
:apabililies through hardware parallelism. Parallel Pascal is a programming 
anguage for such parallel computers. 

Conventional serial high-level programming languages are difficult to im- 
ilement efficiently on parallel computers. Most parallel computers are cur- 
ently programmed in either assembly language or a machine-dependent 
ipecial version of Fortran. The main advantages of a general high-level 
anguage for parallel computers, such as Parallel Pascal, are portability, 
>etter error detection and diagnosis facilities, and efficiency. Efficiency must 
>e a prime consideration since with any parallel system extra hardware is 
>eing used to achieve a high-speed performance. Portability is perhaps the 
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most difficult goal to achieve while maintaining efficiency since different 
parallel systems have very different data permutation capabilities. 

There are three fundamental classes of operations on array data which arc 
frequently implemented as primitives on array computers but which are not 
available in conventional programming languages; these are: data reduction, 
data permutation, and data broadcast. These operations have been included as 
primitives in Parallel Pascal. 

Data reduction operations reduce the number of elements in Ihe data by 
applying an operator between array elements and returning the residt. For 
example, a typical reduction operation is to compute a vector which contains 
the sums over the rows of a matrix. Parallel Pascal reduction functions arc 
described in Section 4. Data selection may be considered to be a special type 
of reduction function; however, selective data assignment requires a different 
syntactic structure. Data selection and selective data assignment use a similar 
syntactic form in Parallel Pascal and are described in Section 3. 

Data permutation and data broadcast functions rarely need to be used in 
conventional programming languages since random access to array elements 
is usually adequately dealt with by index expressions. In a parallel system it 
is usually much more efficient to do a data permutation in parallel. Data 
permutation and broadcast functions are described in Section 5. 

A single parallel control statement, the “where” statement, is defined in 
Parallel pascal. It is similar to an “if” statement but with an array control 
variable. The where statement is described in Section 6. A method of acces- 
sing the individual bits of array data elements, which is possible with bit- 
serial parallel computers, is outlined in Section 7. 

Parallel Pascal was originally designed as a high-level language for 
NASA’s Massively Parallel Processor (MPP), which was constructed by 
Goodyear Aerospace [IJ. The MPP has a single-instruction unit and 16,384 
processing elements organized in a 128 x 128 matrix with near- neighbor 
interconnections. Other parallel computers with a similar interconnection 
scheme include Illiac IV and the DAP. Parallel Pascal is also suitable lor 
parallel computers with a less restrictive interconnection scheme such as is 
found with many pipeline systems, for example. In this case, it may be 
necessary to implement some additional data-mapping functions in order to 
take advantage of all the parallel computer’s capabilities. This is discussed 
. further in Section 5.3. An in-depth description of Parallel Pascal and oilier 
aspects of the MPP is given in [2] and also in |3|. 

A Parallel Pascal-to-standard Pascal translator has been developed to allow 
- initial experimentation with different language features. This translator is 
now being used for program development of Parallel Pascal programs on 
conventional serial computers. 

A compiler has been developed {3, 2] which converts a Parallel Pascal 
program into a parallel P-code form. Interesting features of this P-code 
language are outlined in Section 8. A parallel P-code code generator for the 
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MPl’ has been developed by Computer Sciences Corporation. Parallel Pascal 
is the first operational high-level language for the MPP. 

A considerable effort was made to maintain the concepts of standard Pascal 
in designing the extensions for Parallel Pascal. However, there were several 
features of standard Pascal which cause problems in the parallel computer 
context. These are discussed in Section 9. 

2. Parallel Expressions 

In standard Pascal the only aggregate array operation defined for arrays is 
to copy one array to another. For example, given the definition 

var a, b, c: array [I..I0| of integer; 

the following statement is valid and means copy all elements of array b to 
array a. 

a : = b\ 

In Parallel Pascal all conventional expressions are extended to array data 
types. In a parallel expression all operations must have conformable array 
arguments. A scalar is considered to be conformable to any type of com- 
patible array and is conceptually converted to a conformable array with all 
elements having the scalar value. For example, the statement 

a : — b + c + I; 

is equivalent to 

for f : = 1 to 10 do 

oli'l :=b[i) + c[/J + I; 

In many highly parallel computers there are at least two different primary 
memory systems; one in the host and one in the processor array. Parallel 
Pascal provides the reserved word parallel to allow programmers to specify 
the memory in which an array should reside. In standard Pascal an array type 
is specified with the syntax 

type newtype = array [indextype] of eltype; 

where indextype specifies the number and range of the array dimensions and 
eltype specifies (he type of the array elements. A parallel array type is 
specified with the syntax 


type newtype = parallel array (indextype] of eltype; 

The parallel specifier exists only to provide information to the compiler as to 
the variables’ usage. In all usage in the language a parallel array is indistin- 
guishable from a conventional array. In some systems there is no distinction 
between host and processor memories; then the parallel specifier has no 
effect. In any case, a compiler may decline to store the array where requested. 

3. Array Selection 

Selection of a portion of an array by selecting cither a single index value 
or all index values for each dimension is frequently used in many parallel 
algorithms, e.g., to select the ith row of a matrix which is a vector. 
Specification of a single index value is the standard indexing method in 
standard Pascal. In Parallel Pascal all index values can be specified by eliding 
the index value for that dimension. For example, given the definition 

var a,b: array [I.. 5, l..l()| of integer; 

in Parallel Pascal the statement 

fll.U := b[, 41; 

assigns the fourth column of b to the first column of a. It is interesting to note 
that standard Pascal already permits the assignment of whole arrays and of 
subarrays when the rightmost dimensions have been completely elided; there- 
fore, the following are valid statements in standard Pascal: 

a : — /;; 

The second statement means assign the second row of b to the first row of a; 
in Parallel Pascal this could also be specified by 

olU := *|2,1; 

3.1. Subrange Constants 

It is sometimes necessary to move data between arrays with different 
dimensions. In Parallel Pascal subarrays consisting of consecutive sets of 
elements may be specified. If subarrays with other than consecutive elements 
are required then they must be packed into the consecutive form with per- 
mutation functions. The concept of a constant subrange is introduced in order 
to specify a consecutive subset of index values. 
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The syntax for the constant subrange is 

const identifier = low.. high; 

where low and high are either literals or previously defined constant 
identifiers. 

3.2. Subrange Indexing and Array Packing 

Subrange constants may be used to index an array in Parallel Pascal. The 
general syntax for a subrange index is 

anay-idenlifier|offset @ subrange-constant] 

where offset is an optional conventional scalar index expression. The ordered 
set of indices specified by a subrange index is the result of adding the value 
of (he offset expression to the values implied by the subrange constant. For 
example, given the definition 

var a, b : array {I .. 10] of integer; 

the statement 

ol@2..61 := ft|3 @ 1 . 5|; 

is functionally equivalent to 

for i ; = I to 5 do 

a\i + I] := b\i + 3]; 

The main reason for the introduction of subrange indexing was to permit 
blocks of data to be transferred between arrays having different dimensions. 
It was not designed to be a tool for algorithm development. An alternative 
specification of this operation, which was considered, was to use a new 
standard procedure, called replace, which is analogous in style to the standard 
Pascal pack and unpack functions. A possible syntax for replace is 

replaced, Oa\, Ra I, Oal, Ra2, .... ft, Obi, Rb 1, . . .) 

where Oal is the offset of the first dimension of array a, Ra I is the range of 
the first dimension of array a, etc.. 

This syntax works well for vectors but results in a proliferation of argu- 
ments for higher-dimensional arrays (four for each dimension). The more 
syntactically complex subrange indexing scheme was chosen because it was 
considered to be more readable for multidimensional arrays. 
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3.3. Array Canformability 

In standard Pascal, data items combined together in an expression must be 
type compatible. In Parallel Pascal, array data items in a parallel expression 
must also be conformable, i.e., have the same rank (number of dimensions) 
and (he same range in each dimension. For example, given the definitions 



a : = ft + r; 


is not conformable since the specified ranges of ft and c are different. 

While the exact range conformability requirement is in keeping with the 
strong typing concepts of standard Pascal, there are occasions when the action 
specified by the above statement is useful. The range requirement can be 
explicitly circumvented by using subrange indexing. For example, the 
statements 


a ft + c|@0..9|; 
fl|@l..l()] ;= ft|@l .101 + r; 
o|@l,.IO| := ft|(q»l..|0] + <|(q>0..9|; 

are all conformable and have the same effect. 


4. Reduct ion Funct ions 

Array reduction operations are achieved with a set of standard functions in 
Parallel Pascal which are listed in Table I. 

TABLE I 

Reduction Functions 

Syntax Meaning 

sum(array, Dl , D2 Dn) Reduce array willi arithmetic sum 

prod(array, Dl, D2, . . . , Dn) Reduce array with arithmetic product 

all(nrray, Dl, D2, . . . , Dn) Reduce array with Boolean AND 

any(array, Dl, D2, . . . , Dn) Reduce array with Boolean OR 

max(array, Dl, D2 Du) Reduce array with arithmetic maximum 

min(array, Dl, D2 Du) Reduce array with arithmetic minimum 
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Hie first argument of a reduction function specifies the array to be reduced 
and the following arguments specify which dimensions are to be reduced. A 
dimension is specified by a constant expression; the first dimension is dimen- 
sion I . The dimension parameters must be constant expressions so that the 
shape of the result is known at compile time, 
for example, given the the definitions 

var 

a: array(I..IO,l..5| of integer; 
b : array (1..I0J of integer; 
c: integer; 

the following are correct Parallel Pascal statements 

b : = sum(o, 2); (* sum the rows of a *) 

c := sum (a, I, 2); (* sum all elements of the array a *) 

c : = max(/>, I); (* find the maximum value of b *). 

Each dimension parameter of a reduction function implies that there will be 
one less dimension in the result array; a scalar is considered to be an array 
without any dimensions in this context. 


5. Permutation and Distribution Functions 

One of the most important features of a parallel programming language is 
the facility to specify parallel array data permutation and distribution oper- 
ations. In Parallel Pascal four such operations are available as primitive 
standard functions; however, for some Parallel Processors it may be necessary 
to specify more primitive functions for efficiency. The standard Parallel 
Pascal functions for data permutation and distribution are given in Table II. 


5.1. Shift and Rotate 

The shift and rotate primitives are found in many parallel hardware archi- 
tectures and, also, in many algorithms. The shift function shifts data by the 
amount specified for each dimension and shifts zeros (null elements) in at the 
edges of the array. Elements shifted out of the array are discarded. The rotate 
function is similar to the shift function except that data shifted out of the array 
are inserted at the opposite edge so that no data are lost. The first argument 
to the shift and rotate functions is the array to be shifted; then there is an 
ordered set of parameters, each of which specifies the amount of shift in its 
corresponding dimension. There must be as many shift parameters as there 
are dimensions in the array; the first shift parameter is associated with the first 
dimension of the array. 

For example, given the definitions 

var 

a, b : array (I..5, 0..9) of integer; 
c, d: array |0..9| of integer; 

the statement 

a : = shiftfb, 0, 3); 

is functionally equivalent to 

Tor i I lo 5 ilo 
begin 

for j : = 0 lo 6 do 
a\ij\ : = b [i,j + 3]; 
for j : — 7 to 9 do 
o\i,j) • = 0; 

end; 


TABLE 1! 

Permutation and Distribution Functions 


Syntax Meaning 

shifl(airay, SI, S2, . . . , Sn) End-off shift data within array 

rotatefarray, SI, S 2, .... Sn) Circularly rotate data within array 
transpose(array, D\, D2) Transpose Iwo dimensions of array 

ninaiuiLirrav dim. ranee) Expand array along specified dimension 


and the statement 


c : = rotate(r/, 3); 


is functionally equivalent to 

for i : = 0 lo 9 do 

/•III ^ =: //f ( » -4- niiwl ini- 
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5.2. Transpose and Expand 

While transpose is not a simple function to implement with many parallel 
architectures, a significant number of matrix algorithms involve this function; 
therefore, it has been made available as a primitive function in Parallel 
Pascal. The first parameter to transpose is the array to be transposed and the 
following two parameters, which are constant expressions, specify which 
dimensions are to be interchanged. If only one dimension is specified then the 
array is Hipped about that dimension. 

The main data distribution function in Parallel Pascal is expand. This 
function increases the rank of an array by one by repeating the contents of the 
array along a new dimension. The first parameter of expand specifies the array 
to be expanded; the second parameter, a constant expression, specifies the 
number of the new dimension and the last parameter; a subrange or a subrange 
type specifies the range of the new dimension. 

This function is used to maintain a higher degree of parallelism in a parallel 
statement; this may result in a clearer expression of the operation and a more 
direct parallel implementation. In a conventional serial environment such a 
function would simply waste space. 

For example, given the definitions of a, b, and c as specified in Section 5. 
the following statement adds a vector to all rows of a matrix 

a := b + expand(c, I, 1..5); 

The above statement is functionally equivalent to 

for i:— I to 5 do 
<i|i,] := b[i, 1 + c; 

5.3. Ollier Fund ions 

While the shift and rotate functions are adequate primitive functions for 
highly parallel computers such as the MPP, which have near-neighbor mesh 
interconnections, different primitive functions may be necessary for other 
architectures. For example, consider an architecture which can directly im- 
plement a perfect shuffle permutation and the algorithm to be implemented is 
the Fast Fourier Transform (FFT). If a perfect shuffle function is available in 
the high-level language then the FFT can be very clearly and efficiently 
specified, whereas a specification using only shift and rotate would be very 
difficult to write and the compiler would have to do a clever code optimization 
in order to achieve an efficient implementation. 

A solution which is consistent with the framework of Parallel Pascal is to 
define the shuffle as a primitive permutation function for this computer archi- 
tecture. Portability may be maintained by writing a library function for the 
shuffle using only shift and rotate primitives. In this way an algorithm written 


ORIGINAL ES 

- OF POOR QUAUTY 


PARALLEL PASCAL 


73 


architecture. The efficiency of such a ported algorithm may not be very high 
on the MPP and some reprogramming of the algorithm may be necessary. 

Given the wide diversity of parallel processor architectures it is perhaps 
unreasonable to expect that a single algorithm specification can be efficiently 
compiled for all systems. The scheme proposed above for Parallel Pascal 
permits simple porting of all algorithms to different architectures for 
verification and provides a consistent notation for reprogramming an algo- 
rithm to tune it for a particular architecture. This scheme is made possible 
because Parallel Pascal uses the conventional function syntax to specify 
permutation primitives. In this way the use of new primitive functions is 
syntactically indistinguishable from using a user-defined permutation func- 
tion. 

6. Conditional Execution 

An important feature of any parallel programming language is the ability 
to have an operation operate on a subset of the elements of an array. In 
standard Pascal each array element is processed by a specific sequence of 
statements and there are a variety of program control structures for the 
repeated or selective execution of statements. In Parallel Pascal the whole 
array is processed by a single statement; therefore, an extended program 
control structure is needed. 

In the initial specification for Parallel Pascal all the standard Pascal control 
structures were extended to accept an array control expression. These ex- 
tended control structures proved to be very difficult to exactly specify within 
the framework of standard Pascal. Furthermore, the need for the looping 
control structures is much less in an environment which allows parallel 
expressions. For example, the APL programming language, which involves 
a very powerful array expression capability, does not have a direct program 
loop control structure. In the test algorithms that were programmed only the 
extended if statement was frequently used. The other control structures were 
dropped from the language and the conditional execution statement was 
renamed where due to semantic differences with the standard if statement. 

The syntax of the Parallel Pascal where statement is as follows: 

where array-expression do 
statement 
otherwise 
statement 

where array-expression is a Boolean- valued array expression and statement is 
a Parallel Pascal statement. The otherwise and the second controlled state- 
ment may be omitted. 


74 


ANTHONY P. REEVES 


PARALLEL PASCAL 


75 


The execution of a where structure is defined as follows. First, the control- 
ling expression is evaluated to obtain a Boolean array (mask array). Next, the 
first controlled statement is evaluated. Array assignments are masked accord- 
ing to the Boolean control array. If there is an otherwise statement it is then 
evaluated; in this case array assignments are masked with the inverse of the 
control array. 

For example, given the definition 

vara, b, c: array |!..I0] of integer; 

the expression 

where a < b do 
c : = b 
otherwise 

c : = a; 

is functionally equivalent to 

for i := I to 10 do 
if ali'l < b[i] then 
clil := b|i] 
else 

c[i] : = a]/]; 

The main semantic difference between the where-do-othcrwise structure 
and the if-then-else structure is that with the former both controlled state- 
ments are evaluated, independent of the value of the control expression, while 
with the latter only one of the two controlled statements is evaluated. 

Where statements may be bested provided that all of the controlling array 
expressions are type compatible. Other standard Pascal control statements can 
also be nested within where statements. Any array variable which appears on 
the left-hand side of an assignment within a where-controlled statement must 
be type compatible with the controlling array expression. Assignments to 
other than array variables in a where statement are in no way affected by the 
where statement. The effect of a where statement is local to the procedure 
or function in which it occurs; that is, it does not affect the execution of any 
procedures or functions called from within a where statement or an otherwise 
statement. 

The where statement provides Parallel Pascal with a conditional assign- 
ment fnrilitv An alternative to conditional assignment which was not imple- 


mented in Parallel Pascal is conditional evaluation. In this case, all the 
operations are masked; only data elements which contribute to mask selected 
results are processed. This is conceptually more pleasing since only needed 
computations are specified, whereas, with conditional assignment, extra com- 
putations are conceptually performed which are discarded by the assignment 
operation. 

Conditional evaluation can also be useful for catching or avoiding excep- 
tional conditions. For example, the following statement will successfully 
execute if any element of a is zero and conditional evaluation ndes are obeyed 
but will cause a divide-by-zero error if conditional assignment is used since 
the reciprocal of a is computed for all elements. 

where a <> 0 do r : = I/a; 

While conditional evaluation provides some additional capabilities, it also 
introduces some semantic difficulties. The main problem occurs when an 
array expression is passed to a procedure or function. What values are passed 
for those elements for which the controlling expression is false? Another 
problem arises with the use of standard functions which alter the shape of 
arrays. At what point is the masking applied? 


7. Bit Plane Indexing 

A feature of several current highly parallel computers such as the MPP and 
the DAP is that arithmetic is conducted at the bit level rather than the word 
or number level. That is, the computer “word" or bit plane manipulated by 
these computers is a single-bit slice through all elements in the array being 
processed. 

Some algorithms can be made considerably more efficient for these com- 
puters if specified at the bit. plane level. Bit-plane indexing was added to 
Parallel Pascal to enable a programmer to conveniently specify most of these 
special algorithms without resorting to an assembly code subroutine. 

A bit-plane index is specified by the last item in an index expression and 
is separated from other indices by a colon. The result of a bit-plane-indexed 
array has a Boolean element type. For example, given the definition 

vara: array (I.. 5, 1. 10] of integer; 
varb: array |1.. 5, 1 . 10] of Boolean; 

then the statement 

b := n]:0]; 
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is equivalent to 

b : = odd(ri); 

The next example subtracts one from the selected array element if necessary 
to make it exactly divisible by 2. 

<i(3, 1:0] := false; 

The least significant or first bit-plane is always bit-plane 0. Programming 
with bit-plane indexing requires a knowledge of the internal number repre- 
sentation of the parallel processor and is a highly nonportable feature. Fur- 
thermore, bit-plane indexing on a processor which does not operate at the bit 
level is usually very inefficient. 


8. Parallel P Code 

The Parallel Pascal compiler consists of a syntax analysis “front end” and 
a code generation “back end.” These two phases of the compiler communicate 
through an intermediate language called Parallel P-code |4J. The compiler 
and the P-code are based on the P-4 Pascal Compiler |5]. 

Parallel P-code has the following extensions relative to standard P-code for 
parallel languages and computers. First, it provides a mechanism by which 
nonprimitive types may be specified. This is needed because parts of an array 
may be specified which do not have an explicit type declaration. For example, 
the column of a two-dimensional array may be selected and used within a 
parallel expression; however, this column has not been previously defined by 
an explicit type statement. 

Second, Parallel P-code provides an abstract addressing scheme for allo- 
cating and referencing automatically allocated variables. Ibis is done to 
permit optimizing stages and the code generator to determine the memory 
system in which a data item should reside. In many parallel computers there 
are at least two data memory systems, the host memory and the parallel 
computer memory. 

Third, Parallel P-code provides mechanisms for operating upon arrays, 
array subsets, and individual array elements. Fourth, it provides a symbolic 
mechanism for defining and referencing the fields of a record structure. This 
is needed because of the abstract addressing scheme of Parallel P-code. 
Finally, it facilitates conditional assignment, i.e., the where statement, by 
providing mechanisms for establishing, altering, and removing a Boolean 
mask array. A mask stack facility is supported for nested where statements. 
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9. Problems with Standard Pascal 

The initial Pascal language as designed by Wirlh is described in |6|; since 
then, the ISO standard Pascal has been specified, a very readable account of 
which is given by Cooper |7], and an 1EEE/ANSI standard has also been 
specified |8], There were two fundamental problems with standard Pascal 
which made it difficult to use in the parallel processor context. First, strong 
typing. °» e of Ihe prime features of Pascal, made it very difficult to write 
general array subprograms which could operate on more than one size of 
array. Second, there is no library facility or separate compilation facility in 
standard Pascal. However, a large number of support functions are needed for 
many of the anticipated applications. 

9. 1 . Strong Typing 

Strong typing is supported with varying degrees in standard Pascal. For 
example, the range of integers is well defined and may be redefined; however, 
for real numbers there is no specification of either range or precision. Array 
types are very strongly typed; two total arrays are only conformable if they 
share a common type declaration. This has to be relaxed in the confer inability 
of the where statement in Parallel Pascal. A controlling array must be of 
element type Boolean, while the controlled arrays may be of any clement 
type. If two arrays have different clement types then they must have com- 
pletely separate type declarations. 

A second fundamental problem is how to define a procedure or function 
which can deal with arrays of different sizes. The conformant array parameter 
described in the ISO standard (for Pascal level I), but not in the ANSI 
standard, is a very good solution to this problem, especially in the parallel 
environment. In this scheme, array type information is explicitly specified 
with each array parameter. A subprogram may be specified to accept several 
different types for an argument; however, all these types are known at com- 
pile time, permitting an efficient code to be generated for each of them. This 
may be especially important for some array computers where the algorithm 
for the generated code may depend upon the array dimensions. 

?.2. Library Subprograms and Separate Compilation 

Standard Pascal has no library facility; all subprograms, i.e., procedures 
ind functions, must be present in the source program. A library preprocessor 
was developed to allow the use of libraries without violating the rules of 
standard Pascal. The header line of a library subprogram is specified in the 
source program with an extern directive. The library preprocessor replaces 
'Jie extern directive with the appropriate subprogram body. The type informa- 
tion for the library subprogram is extracted from the declaration statement in 
Jie source program. Therefore, library subprograms can be written to work 
with any user-specified array type. 
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If a library subprogram is to be used for more than one array type in the 
same block, then a subprogram declaration statement for each unique argu- 
ment type is necessary. Each unique version of the subprogram is identified 
by a user-specified extension to the subprogram name in both declaration and 
usage. 

I 'nr example, consider the ceiling function as defined below: 

funefion ceiling(jr:xtype) : rtype; 
begin 

where x < 0.0 do 
ceiling : = trunc(jc) 
otherwise 

where jr-tnmc(jr) = 0.0 do 
ceiling : = truncfx) 
otherwise 

ceiling : = truncfr) + I ; 

end; 

The following program fragment illustrates how more than one version of this 
function could be specified for the library preprocessor. 

type 

ar = array (1..I0] of real; 
ai — array [I..10] of integer; 
br — array II.. 8, 1..8) of real; 
bi - array II.. 8 , 1..8] of integer; 
function ceiling. a(x :ar) :ai ; extern; 
function ceiling. b(x :br) :bi ; extern; 
var 

ax :ar, ay :ai ; bx:br, by\bi\ 

begin 
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ay : = ceiling. <i(o.r); 
by := ceiling. b(bx)\ 

The simple library preprocessor does not solve the separate compilation 
problem: all requested library subprograms must be recompiled whenever a 
change is made to the main program. However, it is an expedient solution to 
the library problem which will work with all Parallel Pascal compilers. 
External, partially compiled subprograms could be inserted at the P-code 
level or at the code generator level of a compiler. However, they should be 
inserted before the optimization stage so that specific parallel computer sen- 
sitivities to different array sizes may be considered. 


10. Conclusion 

A version of the Pascal programming language for parallel computers has 
been developed which requires very few new language features. One of the 
main features of this language is that permutations are achieved with con- 
ventional function forms. In this way it is simple to introduce new per- 
mutation functions for the efficient programming of a new parallel computer, 
when necessary, without changing the language. 

The obvious extension to standard Pascal was to allow operations on 
complete, and partially selected, arrays. Subrange constants were introduced 
to permit data to be transferred between arrays of different sizes. Several 
features were developed with current parallel computer architectures in mind. 
These include: the parallel specifier for systems with more than one data 
memory, conditional assignment instead of conditional evaluation, and bit 
indexing for computers with bit-serial arithmetic. 

The strong typing of arrays in standard Pascal was found to be a problem 
in some cases and had to be relaxed. Also, a library management capability 
was considered to be very important. One of the attractive features of standard 
Pascal is that as many decisions as possible are made at compile time rather 
. than at run time. This is a vitally important concept in any parallel language, 

• and has been maintained in Parallel Pascal, since runtime decisions on factors 
- such as array size can be very costly on parallel computers. 

* A version of the P-code intermediate language has been developed for 
Parallel Pascal. This contains two major changes from standard P-code. First, 
addressing is symbolic, rather than by direct memory offsets; this permits a 
code generator to select the best memory system for the data. Second, new 
operators have been introduced to deal with aggregate data structures. 
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Appendix D 

PARALLEL PASCAL AND THE MASSIVELY PARALLEL PROCESSOR 
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Ithaca, New York 14853 


INTRODUCTION . 

Parallel Pascal is an extended version of the Pascal programming language which is 
designed for the convenient and efficient programming of parallel computers. Parallel Pascal 
was designed with the MPP as the initial target architecture. It is the first high level pro- 
gramming language to be implemented on the MPP. 

The Parallel Pascal language is outlined in the first section of this chapter; then language 
restrictions on the current MPP compiler imposed by the MPP architecture are discussed. 
Finally, algorithm techniques for efficiently programming the MPP are presented. 

. Conventional serial high-level progr ammin g languages are difficult to efficiently imple- 
ment on parallel computers. Most parallel computers are currently progr amm ed in either 
assembly language or a machine-dependent special version of Fortran. The main advantages 
of a general high-level language for parallel computers, such as Parallel Pascal are portability, 
better error detection and diagnosis facilities, and efficiency. Efficiency must be a prime con- 
sideration since with any parallel system extra hardware is being used to achieve a high speed 
performance. Portability is perhaps the most difficult goal to achieve while maintaining 
efficiency since different parallel systems have very different data permutation capabilities. 
The efficiency of working programs can be enhanced, in some cases, by reprogramming a small 
number of critical procedures in assembly code. 

There are three fundamental classes of operations on array data which are frequently 
implemented as primitives on array computers but which are not available in conventional 
progr ammin g languages, these are; data reduction, data permutation and data broadcast. These 
operations have been included as primitives in Parallel Pascal. 

The design of Parallel Pascal was directed towards the MPP; however, it is also suitable 
for other parallel computers with a similar interconnection scheme such as Illiac IV and the 
DAP. Parallel Pascal is also suitable for parallel computers with a less restrictive interconnec- 
tion scheme such as is found with many pipeline systems for example. In this case, it may be 
necessary to implement some additional data mapping functions in order to take advantage of 
all the parallel computers capabilities. A more detailed discussion of the language design is 
given in [lj. An in depth description of Parallel Pascal and other aspects of the MPP is given 
in [2] and also in [3]. 

A Parallel Pascal to standard Pascal translator has been developed to allow initial experi- 
mentation with different language features. This translator is now being used for program 
development of Parallel Pascal progr ams on conventional serial computers. 

A compiler has been developed [3,2] which converts a Parallel Pascal program into a 
parallel p-code form. Interesting features of this p-code language include: a mechanism for 
non-primitive data types needed because of subarrays, an abstract addressing scheme for 
automatically allocated variables to permit the code generator to decide the appropriate host 
for the code, mech anisms for operating on arrays and subarrays, and a symbolic scheme for 
referencing fields of a record structure. .An description of this p-code is given in [4] and also 
in [2]. A parallel p-code code generator for the MPP has been developed by Computer Sciences 
Corporation. 


PARAULEL EXPRESSIONS 

In Parallel Pascal all conventional expressions are extended to array data types. In a 
parallel expression all operations must have conformable array arguments. A scalar is con- 
sidered to be conformable to any type compatible array and is conceptually converted to a 
conformable array with, all elements having the scalar value. For example, given the 
definition 

var a, b, c array [1-10] of integer; 

the following statement 

a := b + c + 1; 

is equivalent to 

for i >» 1 to 10 do 

a£i] > bfi] + cCi] + 1; 

In many highly parallel computers including the MPP there are at least two different 
primary memory systems; one in the host and one in the processor array. Parallel Pascal pro- 
vides the reserved word parallel to allow programmers to specify the memory in which an 
array should reside. In standard Pascal an array type is specified with the following syntax 

type newtype = array [indextype] of eltype; 

where indextype specifies the number and ranges of the array dimensions and eltype specifies 
the type of the array elements. A parallel array type is specified with the syntax 

type newtype - parallel array [indextype] of eltype; 

The parallel specifier exists only to provide information to the the compiler as to the variables 
usage. In all usage in the language a parallel array is indistinguishable from a conventional 
array. In some systems there is no distinction between host and processor memories, then the 
parallel specifier does not have any effect. In any case, a compiler may decline to store the 
array where requested. 


ARRAY SELECTION 

Selection of a portion of an array by selecting either a single index value or all index 
values for each dimension is frequently used in many parallel algorithms; e.g, to select the ith 
row of a matrix which is a vector. Specification of a single index value is the standard index- 
ing method in standard Pascal. In Parallel Pascal all index values can be specified by eliding 
the index value for that dimension. For example, given the definition 

var a,b: array [1-5,1-10] of integer; 

in Parallel Pascal the statement 

a£,l] > b(,4i 

assigns the fourth column of b to the first col umn of a. The following are valid statements in 
standard Pascal 


a :=■ b; 
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a[l] :=■ b(2i 

The second statement means assign the second row of b to the first row of a; in Parallel Pascal 
this could also be specified by 

a[l,] :=* b(2,i 


SUBRANGE CONSTANTS 

It is sometimes necessary to move data between arrays with different dimensions. In 
Parallel Pascal subarrays consisting of consecutive sets of elements may be specified. If subar- 
rays with other than consecutive elements axe required then they must be packed into the 
consecutive form with permutation functions. The concept of a constant subrange is intro- 
duced in order to specify a consecutive subset of index values. 

The syntax for the constant subrange is 
const identifier =■ low-high; 

where low and high are either literals or previously defined constant identifiers. 


SUBRANGE INDEXING AND ARRAY PACKING 

Subrange constants may be used to index an array in Parallel Pascal The general syntax 
for a subrange index is 

array-identifierf offset @ subrange-constant ] 

where offset is an optional conventional scalar index expression. The ordered set of indices 
specified by a subrange index is the result of adding the value of the offset expression to the 
values implied by the subrange constant. For example, given the definition 

var a, b: array [l-10] of integer; 
the statement 

a[@2-6] :« b(3 @ l_5i 
is functionally equivalent to 

for i :=» 1 to 5 do 
a£i + 1] := b(i + 3£ 

The main reason for introducing subrange indexing was to permit blocks of data to be 
transferred between arrays having different dimensions. It was not designed to be a tool for 
algorithm development. 


ARRAY CONFORMABELITY 

In standard Pascal, data items combined together in an expression must be type compati- 
ble. In Parallel Pascal, array data items in a parallel expression must also be conformable, Le. 
have the same rank (number of dimensions) and the same range in each dimension. For exam- 
ple, given the definitions 
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var a, b: array [1-10] of integer: 
c array [0-9] of integer: 

the statement 

a :=» a + b; 

is conformable, -while the statement 
a :=* b + c; 

is not conformable since the specified ranges of b and c are different. 

While the exact range conformability requirement is in keeping with the strong typing 
concepts of standard Pascal, there are occasions when the action specified by the above state- 
ment is useful. The range requirement can be explicitly circumvented by using subrange 
indexing. For example, the statements 

a := b + cf@0_9l 
a[@l-10] b(@l-10] + c; 
a[@l-10] bt@l-10] + cf@CL9i 

are all conformable and have the same effect. 


REDUCTION FUNCTIONS 

Array reduction operations are achieved with a set of standard functions in Parallel Pas- 
cal which are listed in table 1. 


Table 1: Reduction Functions 


Svntax 

Meanine 

sum(array, Dl, D2, Dn) 
prod(array, Dl, D2, Dn) 
all(array, Dl, D2, -- Dn) 
aay(array, Dl, D2, «, Dn) 
max(array, Dl, D2, _, Dn) 
minCarray, Dl, D2, Dn) 

reduce array with arithmetic sum 
reduce array with arithmetic product 
reduce array with Boolean AND 
reduce array with Boolean OR 
reduce array with arithmetic maximum 
reduce array with arithmetic minimum 


The first argument of a reduction function specifies the array to be reduced and the fol- 
lowing arguments specify which dimensions are to be reduced. A dimension is specified by a 
constant expression; the first dim ension is dim ension l. The dimension parameters must be 
constant expressions so that the shape of the result is known at compile time. 

For example, given the the definitions 
var 

a: array[l_10,l-5] of integer: 
b: arrayfl-lO] of integer; 
a integer; 

the following are correct Parallel Pascal statements 
b :=» sum(a, 2); (* sum the rows of a *) 
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c := sum(a, 1, 2); (* sum all elements of the array a *) 
c > max(b, l); (* find the ma-rimum value of b *) 

Each dimensi on parameter of a reduction function implies that there will be one less dimen- 
sion in the result array; a sc alar is considered to be an array without any dimensions in this 
context. 


PERMUTATION AND DISTRIBUTION FUNCTIONS 

One of the most important features of a parallel programming language is the facility to 
specify parallel array data permutation and distribution operations. In Parallel Pascal four 
such operations are available as primitive standard functions; however, for some Parallel Pro- 
cessors it may be necessary to specify more primitive functions for efficiency. The standard 
Parallel Pascal functions for data permutation and distribution are given in table 2. 


Table 2: Permutation and Distribution Functions 


Syntax 

Meaning 

shiftCarray, SI, S2, Sn) 
rotate(array, SI, S2, _, Sn) 
transpose(array, Dl, D2) 
expandCarrav, dim, range) 

end-off shift data within array 
circularly rotate data within array 
transpose two dimensions of array 
expand array along specified dimension 


SHIFT AND ROTATE 

The shif t and rotate primitives are found in many parallel hardware architectures and 
also, in many algorithms. The shift function shifts data by the amount specified for each 
dimension and shifts zeros (null elements) in at the edges of the array. Elements shifted out 
of the array are discarded. The rotate function is similar to the shift function except that 
data shifted out of the array is inserted at the opposite edge so that no data is lost. The first 
argument to the shift and rotate functions is the array to be shifted; then there is an ordered 
set of parameters, each one specifies the amount of shift in its corresponding dimension. There 
must be as man y shift parameters as there are dimensions in the array; the first shift parame- 
ter is associated with the first dimension of the array. 

For example, given the definitions 
var 

a, b: array [ 1-5, 0-9] of integer; 

c, d: array [0-9] of integer; 

the statement 

a := shift(b, 0, 3); 
is functionally equivalent to 

for i := 1 to 5 do 
begin 

for j := 0 to 6 do 

a£i»j] '•= bkj+31 
for j 7 to 9 do 
a£i,j] := 0; 


c 
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end; 

and the statement 

c > rotateCd, 3); 

is functionally equivalent to 

for i ^ 0 to 9 do 
c[i] :=■ d[G + 3) mod 10i 


TRANSPOSE AND EXPAND 

While transpose is not a simple function to implement with many parallel architectures, 
a significant number of matrix algorithms involve this function; therefore, it has been made 
available as a primitive function in Parallel Pascal The first parameter to transpose is the 
array to be transposed and the following two parameters, which are constant expressions, 
specify which dimensions are to be interchanged. If only one dim ension is specified then the 
array is flipped about that dimension. 

The main data distribution function in Parallel Pascal is expand. This function increases 
the rank of an array by one by repeating the contents of the array along a new dim ension. 
The first parameter of expand specifies the array to be expanded, the second parameter, a con- 
stant expression, specifies the number of the new dimension and the last parameter, a subrange 
or a subrange type, specifies the range of the new dimension. 

This function is used to maintain a higher degree of parallelism in a parallel statement; 
this may result in a clearer expression of the operation and a more direct parallel implementa- 
tion. In a conventional serial environment such a function would simply waste space. 

For example, given the definitions of a, b, and c as specified in section 5.1 the following 
statement adds a vector to all rows of a matrix 

a>b + expandCc, 1, 1-5); 

The above statement is functionally equivalent to the following 

for i > 1 to 5 do 
a(ij >.b[ij + c; 


CONDITIONAL EXECUTION 

An important feature of any parallel programming language is the ability to have an 
operation operate on a subset of the elements of an array. In standard Pascal each array ele- 
ment is processed by a specific sequence of statements and there are a variety of program con- 
trol structures for the repeated or selective execution of statments. In Parallel Pascal the 
whole array is processed by a single statement; therefore, an extended program control struc- 
ture is needed. 

The syntax of the Parallel Pascal where statement is as follows: 

where array-expression do 
statement 
otherwise 
statement 
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where array-expression, is a Boolean valued array expression and statement is a Parallel Pascal 
statement. The otherwise and the second controlled statement may be omitted. 

The execution of a where structure is defined as follows. First, the controlling expres- 
sion is evaluated to obtain a Boolean array (mask array). Next, the first controlled statement is 
evaluated. Array assignments are masked according to the boolean control array. If there is 
an otherwise statement it is then evaluated; in this case array assignments are masked with 
the inverse of the control array. 

For example, given the definition 

var a, b, carray [1-10] of integer; 

the following expression 

where a < b do 
c :=> b 
otherwise 
c :=> a; 


is functionally equivalent to 

for i :=■ 1 to 10 do 
if a£i] < b(i] then 

c£i] > bti] 
else 

c{i] > a[it 

The main semantic difference between the where-do-otherwise structure and the if- 
then-else structure is that with the former both controlled statements are evaluated, indepen- 
dent of the value of the control expression, while with the latter only one of the two con- 
trolled statements is evaluated. 

Where statements may be nested provided that all of the controlling array expressions 
are type compatible. Other standard Pascal control statements can also be nested within 
where statements. Any array variable which appears on the left hand side of an assignment 
within a where controlled statement must be type compatible with the controlling array 
expression. Assignments to other t han array variables in a where statement are in no way 
affected by the where statement. The effect of a where statement is local to the procedure or 
function in which it occurs; that is, it does not affect the execution of any procedures or func- 
tions called from within a where statement or an otherwise statement. 


BIT-PLANE INDEXING 

A feature of several current highly parallel computers such as the MPP is that arith- 
metic is conducted at the bit level rather than the word or number level. That is, the com- 
puter "word* or bit plane manipulated by these computers is a single bit slice through all ele- 
ments in the array being processed. 

Some algorithms can be made considerably more efficient for these computers if specified 
at the bit plane level. Bit-plane indexing was added to Parallel Pascal to enable a programmer 
to conveniently specify most of these special algorithms without resorting to an assembly code 
subroutine. 

A bit-plane index is specified by the last item in an index expression and is separated 
from other indices by a colon. The result of a bit-plane indexed array has a Boolean element 
type. For example, given the definition 
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var a: array [1-5,1-10] of integer; 
var b: array [1-5,1-10] of Boolean; 

then the statement 


b :=■ a{.-0i 
is equivalent to 
b :=* odd(a); 

The next example subtracts one from the selected array element if necessary to make it 
exactly divisible by 2. 

a[ 3,1:0] :=■ false; 

The least significant or first bit-plane is always bit-plane 0. Programming with bit-plane 
indexing requires a knowledge of the internal number representation of the parallel processor 
and is a highly non portable feature. Furthermore, bit-plane indexing on a processor which 
does not operate at the bit level is usually very inefficient. 


LIBRARY SUBPROGRAMS AND SEPARATE COMPILATION 

Standard Pascal has no library facility; all subprograms in, procedures and functions, 
must be present in the source program. A library preprocessor was developed to allow the use 
of libraries without violating the rules of standard Pascal The header line of a library sub- 
program is specified in the source program with an extern directive. The library preprocessor 
replaces the extern directive with the appropriate subprogram body. The type information for 
the library subprogram is extracted from the declaration statement in the source program. 
Therefore, library subprograms can be written to work with any user specified array type. 

If a library subprogram is to be used for more than one array type in the same block, 
then a subprogram declaration statement for each unique argument type is necessary. Each 
unique version of the subprogram is identified by a user specified extension to the subprogram 
name in both declaration and usage. 

For example, consider the ceiling function as defined below: 

function ceiliugCxrnype) : rtype; 

begin 

where x < 0.0 do 
ceiling :=» truncCx) 

otherwise 

where x-trunc(x) =» 0.0 do 
ceiling > truncCx) 
otherwise 

ceiling > trunc(x)+l; 

end; 

The following program fragment illustrates how more than one version of this function could 
be specified for the library preprocessor. 


type 

ar =» array [1-10] of real; 
ai = array [1-10] of integer; 
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br => array [l_8, 1-8] of real; 
bi = array [l_S, 1-8] of integer; 
function ceilinga(xnr) ai; extern; 
function ceiling.b(x:br) :bi; extern; 
var 

axax; ayai; bxtbr; by.bi; 
begin 

ay :=* ceilinga(ax); 

by > ceiling.b(bx); 

The simple library preprocessor does not solve the separate compilation problem: all 
requested library subprograms must be recompiled whenever a change is made to the main 
program- However, it is an expedient solution to the library problem which will work, with 
all Parallel Pascal compilers. External, partially compiled sub p r o grams could be inserted at 
the p-code level or at the code generator level of a compiler. However, they should be 
inserted before the optimization stage so that specific parallel computer sensitivities to different 
array sizes may be considered. 


MPP COMPILER RESTRICTIONS 

The Parallel Pascal compiler for the MPP currently has several restrictions. The most 
important of these is that the range of the last two dimensions of a parallel array are con- 
strained to be 128; in, to exactly fit the parallel array size of the MPP. It is possible that 
language support could have been provided to mask the hardware details of the MPP array 
size from the programmer; however, this would be very difficult to do and efficient code gen- 
eration for arbitrary sized arrays could not be guaranteed. Matrices which are smaller than 
128 x 128 can usually be fit into a 128 x 128 array by the progr amm er. Frequently, arrays 
which are larger than 128 x 128 are required and these sure usually fit into arrays which have 
a conceptual size which is a multiple of 128 x 128. 

A large matrix of dimensions (m * 128) x (n * 128) is specified by a four dimensional 
array in the MPP version of Parallel Pascal which has the dimensions m x n x 128 x 128. 
There are two fundamental methods for packing the large matrix data into this four dimen- 
sional array, this packing may be directly achieved by the staging memory in both cases. In 
the "c rinkl ed" packing scheme a m x n matrix of adjacent large matrix elements is assigned to 
each PE; adjacent submatrices are assigned to adjacent PE’s. More formally, element (i,j) of the 
large matrix is mapped to location [i mod m, j mod n, i div m, j div n] of the four dimensional 
array. 

The alternative packing scheme, called "blocked" packing, assigns adjacent large matrix 
elements to adjacent PFs in blocks of 128 x 128. The large matrix is represented by a m x n 
matrix of adjacent 128 x 128 blocks. More formally, element (i,j) of the large matrix is 
mapped to location [i div 128, j div 128, i mod 128, j mod 128] of the four dimensional array. 

The best method of large array packing is application dependent which is one reason that 
large arrays are not handled automatically by the compiler. 

Progr ammin g with large matrices stored as four dimensional arrays is very simple in 
Parallel Pascal. In general, programs developed for a single 128 x 128 array are easily modified 
to deal with large packed matrices. Simple arithmetic expressions directly extend to higher 
dimensioned arrays, reduction functions may require additional dimension specifiers. Shift 
and rotate operations require special consideration since care must be taken to correctly 
transfer data between the boundaries of the submatrices. Generic library functions have been 
written in Parallel Pascal to deal with large matrices. The functions lshif t and lrotate will 
correctly manipulate large matrices stored with the blocked packing scheme and the functions 
crshift and crrotate will deal with matrices stored with the crinkled packing scheme. 
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The hardware organization, of the MPP currently imposes some further language restric- 
tions. These could be removed with a more advanced version of the code generator. Host pro- 
grams for the MPP can be run either on the main control unit (MCU) or on the VAX; in the 
latter case the MCU simply relays commands from the VAX to the PE array. The advantages 
of running on the VAX is a good programming environment, floating point arithmetic support 
and large memory (or virtual memory). The advantage of running on the MCU is more 
direct control of the MPP array. 

Compiler directives are used to specify if the generated code should run on the MCU or 
the VAX. With the current implementation of the code generator, only complete procedures 
can be assigned to the MCU and only programs on the MCU can manipulate parallel arrays. 
There are several other language restrictions for programs which are run on the MCU such as 
no conventional I/O. Therefore, the programmer must isolate sections of code which deal with 
the PE array in procedures which are directed to the MCU. A better strategy mig ht be to run 
the majority of the host program code on the VAX with only small sections which deal with 
PE array on the MCU, then there would be no language restrictions for the progr amm er but 
the code generator would be more complex. 


LIBRARY PROGRAM DEVELOPMENT 

Initial experience with developing algorithms for the MPP indicate that library func- 
tions must frequently be developed in three different forms for maximum efficiency. The 
basic form is a pure Parallel Pascal algori thm which takes the greatest advantage of the paral- 
lel features of the language. This form will run directly on the MPP array for arrays with 
the last two dimensions being 128 x 128. 

The second form is for large arrays on the MPP, La, arrays with dimensions which are 
exact multiples of 128 x 128. In many cases, such as the near neighbor operations described in 
the next section, the transformation to this form from the previous form is very simple. In 
general, shift and rotate functions are replaced by lshift and lrotate library functions. 

The third form is for functions which are to be implemented on the VAX host com- 
puter. While parallel algorithms will run correctly on the host computer such algorithms do 
not take advantage of the direct indexing capabilities of the VAX and much more efficient 
serial algorithms may be possible. Furthermore, any algorithms which use the bit-indexing 
language features can usually be much more efficiently reprogrammed for the VAX since it 
does not have an efficient bit indexing mechanism. 


LIBRARY PACKAGES 

The Parallel Pascal language provides basic efficient orthogonal primitives for developing 
application programs. However, it is expected that for any specific application area application 
directed primitive library functions will be required. Several application library packages 
have already been developed for Parallel PascaL These include; large matrix shift and rotate, 
near neighbor functions, a general permutation function which is used for matrix rotation and 
polynomial warping, a parallel random number generator, and pyramid data structure func- 
tions. 

In the next two sections, programming techniques will be illustrated with examples 
from the near neighbor package and the permutation package. 


THE NEAR NEIGHBOR LIBRARY PACKAGE 

The near neighbor package illustrates how Parallel Pascal can provide a convenient 
environment for specifying application primitives; the need for different forms of the same 
function is also demonstrated. A near neighbor (nn) operation is one in which each result ele- 
ment of a matrix is computed by a function of only locally adjacent elements in a 
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corresponding input matrix. Near neighbor operations are frequently used in image processing 
applications. Several high, level languages for image processing include such, operations as basic 
primitives. 

The basic unit frequently used in near neighbor operations is a small matrix (3 x 3 to 7 x 
7) of constant values. The first nn library function, called mx3, is used for specifying 3x3 
matrices. The use of this function is illustrated in the following example. 

type 

mtype - array [1-3, 1-3] of integer; 
function mx3(v00, vOl, v02, 
vlO, vll, vl2, 

v20, v21, v22: integer): mtype; extern; 
var me mtype; 

me > mx3( -1, -1, -1, 

- 1 , 8 ,- 1 , 

- 1 , - 1 , - 1 >, 


The matrix me is set with all boundary elements to -1 and the center element to 8 as is pic- 
torially shown. A typical filtering operation is to convolve a large image matrix with a small 
kernel matrix; the generic library function conv is designed for this operation- A possible 
de fini tion for conv is shown below: 

type • • • 

mx = parallel array [l-128, 1-128] of eltype; 
function conv(matrixrmx; kemehmtype>mx; 
var i,_p integer; 

sum: mx; 
begin 
sumX); 

for 1 to 3 do begin 

for j> 1 to 3 do begin 
if (kernelthj] O 0) then 
sum > sum + kemel[i,j] * shif t(matrixl-2, j-2); 
end; 

end; 

conv sum 
end; 
var . . . 

ma, mbanx; 


the convolution of ma with the kernel me is specified by 
mb :=» convCma, me); 

The contents of the kernel may also be expressed in the same statement; e.g- 

mb := convOna, mx3(0, 1, 0, 

0 , 0 , 1 , 

0 , 1 , 0 >, 

Sparse kernels with only a few 1 elements occur frequently in some applications. In these 
cases programmers often prefer to specify the kernel in a short-hand form consisting of a list 
of the cardinal directions of the l’s. The library function mxd converts such a list to the 3 x 
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3 matrix; using this function the above statement may be rewritten as 
mb :=* convfma, mxddN, E, S])); 

The large matrix version of conv is simply specified by changing the word shift to Ishift 
of crshift, as is appropriate, and redefining mx to be a four dimensional array. Conv is also 
reasonably organized for a serial processor. An improvement may be possible in this case by 
explicitly writing the loops so that the convolution is computed in one pass through the data 
since better use of a cache memory would result. 


THE PERMUTATION LIBRARY PACKAGE 

The need for more than one version of a library function for the same operation is illus- 
trated with the permutation function. The matrix permutation function has three arguments: 
a data matrix to be permuted, a row matrix which indicates in which row from the data 
matrix the corresponding result element is to be obtained and a col umn matrix which indi- 
cates in which column the result is to be obtained. The function returns the permuted matrix 
On fact any data mapping is possible). A serial version of this function is shown below: 

type 

pa *■ array [lolJiil, lo2_hi2] of integer, 

function penn2s(mxqja; npa; cpa)ma; 

var 

i, j: integer; 

begin 

for i .■=• lol to hil do 
for j :=» lo2 to hi2 do 

penn2s(i,j] :=* mxfr{i,j],c[i,jl]; 

end; 

This function is efficiently programmed for a serial computer such as the VAX host and 
would execute in 0(n ) time for a n x n data matrix. In contrast, this would be a very poor 
algorithm for a parallel array. A single data transfer would require (Xn) time since only near 
neighbor shifts are possible; therefore, the total algorithm would require (Xu ) time. 

We have developed a parallel algorithm for this task which .attempts to move all the 
data together as much as possible. On anarallel processor with (Xn ) PFs this algorithm still 
has a worst case time complexity of (Xn'V but for many structured permutations such as rota- 
tion and warping the complexity will be closer to (Xn). 

The parallel algorithm will execute much more slowly on a serial processor than the 
serial algorithm since the serial algorithm makes direct use of the serial processors indexing 
capability. The parallel algorithm can be easily extended to large matrices which are multi- 
ples of 128 x 128 by directly replacing shift operations with Ishift operations; however, this 
will not be optimal with respect to the number of shift operations needed. For example, with 
the blocked storage scheme a horizontal shift which is a multiple of 128 steps requires no shift 
operations at all since the data is already in the correct PE. A large matrix version of the 
algorithm is currently under development which will attempt to minimize the actual number 
of shift operations executed. 


EXTENDED I/O 

For many applications the standard I/O facilities of Pascal will be adequate. The staging 
buffer of the MPP can be very useful for directly performing cer tain data permutations; these 
permutations cannot be directly specified in the basic language of Parallel Pascal. A high level 
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language facility for using the staging buffer has not yet been implemented; a proposal for 
how the staging buffer may be programmed and used is outlined below. 

There are three new capabilities, made possible with the staging buffer, which we would 
like to specif y in the language. 

1. File reformatting 

Raw data files may not be in the correct format for the MPP array. For example we 
may wish to do large matrix packing or select one band of a multiband file. 

2. Sub-array file I/O 

In some cases it is useful to assemble a large array from sub arrays (which may be 
smaller than 128 x 128). 

3. Data permutations 

The staging buffer is capable of implementing a large range of data permutations. In 
some cases it is effective to transfer data from the PE array through the staging buffer 
and back, to the array in order to achieve a data permutation. 

These new capabilities could be mad* available by introducing two new procedures to 
Parallel Pascal and relaxing one of the Pascal I/O constraints as indicated below; 


FILE REFORMATTING 

File reformatting can be achieved by the "reformat" procedure which specifies a reorder- 
ing of the dimensions of an array structure. For example consider that we have a disk file of 
a set of 128 x 128 images with 6 bands. The file is declared as follows: 

var f: file of parallel array [l_6, 1-128, 1-128] of 0-255; 

This will work correctly if each image is stored on the disk as a sequence of 6 consecutive 128 
x 128 matrices. If the data is stored in pixel interleaved format (Le. as a sequence of 6 ele- 
ment pixels) then the staging buffer must be set to do the format conversion; this can be 
achieved with the following call to the reformat procedure: 

reformatCf, 2, 3, l); 

This specifies that the ordering of the dim ensions for the data on the disk is 2, 3, 1; Le, the 
disk file array has the shape 128 x 128 x 6. 


SUB-ARRAY I/O 

In general, array I/O is done with a file having records of type array; an I/O operation 
then specifies the transfer of a complete array. Sub-array I/O may be achieved by relaxing 
the Pascal restriction that complete file data types must be read or written. The range of the 
dimensions of the subarray must match the last dim ensions of the file array. For example, 
consider the 6 band image file described above and the following array declarations, 
type ar = parallel arraytl-128, 1-128] of 0-255; 
var a: ar; 

b: array [l_6] of ar; 

With the conventional Pascal I/O restrictions only whole images can 
be read; Le^ readCf, b) is a valid statement whereas read(f, a) 
is not since a is not the same type as the file type. 

In the extended I/O scheme, readCf, a) is permitted and reads the bands of an image one 
at a time. Parallel array files having the last dimensions smaller than 128 x 128 may be 
declared in which case a variable in an I/O statement such as a must have a subrange index 
for conformability. 
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DATA PERMUTATIONS 

Data permutations can be achieved with, a "link" procedure which, links two files. 
Linked files form a virtual channel through the staging buffer and, in general, do not require 
any disk space. The data permutation is specified by reformatting the two files. 

The syntax for link, is 
linkCf.g); 

where f and g are files. 

T.ink also implies a "rewrite" on file f and a "reset" on file g. 

For example: a transpose permutation could be specified as follows: 
type 

at => parallel array [1-128, 1-128] of real; 
var 

fa, fb: file of at; 
a, b: at; 

reformat(fa, 2, l); 
linkCfa, fb); 
write(fa, a.}, 
read(fb, b); 

The above set of statements is equivalent to 
b := transposeCa, 2, l); 


CONCLUSION 

A version of the Pascal programming language for parallel computers has been developed 
which required very few new language features. One of the main features of this language is 
that permutations are achieved with conventional function forms. In this way it is simple to 
introduce new permutation functions for the efficient programming of a new parallel com- 
puter, when necessary, without changing the language. 

No attempt was made with the first implementation of the MPP compiler to hide the 
128 x 128 dim ensions of the PE array. This was considered to be necessary in order to ensure 
that efficient algorithms are developed and also ensure that the very limited local memory is 
not squandered. Progr ammin g tools have been outlined for programming larger arrays. A 
future compiler may hide these details from the user as effective programming techniques are 
better understood. 

The only extensions needed for Parallel Pascal to effectively use the MPP hardware are 
the I/O extensions which co nsis t of two new procedures and the relaxation of a standard Pas- 
cal constraint. These enable the staging buffer to be effectively utilized for data permutations 
and file reformatting. 

Parallel Pascal provides convenient orthogonal efficient high level primitives on which to 
build application p rogr a ms. It is more difficult to efficiently program a parallel computer than 
a serial computer; therefore, the estab lishm ent of library packages for application oriented 
primitives is even more important than for the conventional serial case. Some of the program- 
ming techniques for Parallel Pascal have been outlined in the context of packages which are 
currently being developed. 
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xshift - constrained shift function 
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Abstract 

The Parallel Pascal Development System enables Parallel Pascal programs to be 
developed and tested on a conventional computer. It consists of several system programs, 
including a Parallel Pascal to standard Pascal translator, and a library of Parallel Pascal sub- 
programs. The library includes subprograms for using Parallel Pascal on a parallel system 
with a fixed degree of parallelism, such as the Massively Parallel Processor, to conveniently 
m a ni pulate arrays which have different dimensions than the hardware. Programs can be con- 
veniently tested with small sized arrays on the conventional computer before attempting to 
run on a parallel system. 


Introduction 

This manual is organized in two sections: the first describes the system programs and the 
second describes available library subprograms. The library documentation is organized alpha- 
betically on the name of the library function; when more than one function is described, the 
first listed name is used. A subject based index, which groups the library subprogram names 
under logically ordered application headings, is given at the end of the manual. 

In general, a user of this system need be aware of only one program, called pp, which is 
used to compile and link Parallel Pascal programs on a conventional computer system. In 
order to use this system, a user should read the documentation on pp, which is in section 1 of 
this manual, and also be familiar with the Parallel Pascal programming language, which is a 
superset of the Pascal programming language. The Parallel Pascal extensions to Pascal are 
described in p pascal which is also in section 1 of this manual; a more detailed discussion of 
the language is given in [l]. An in depth description of Parallel Pascal and other aspects of 
the MPP is given in [2] and also in [3J. For information on standard Pascal, the initial Pascal 
language as designed by Wirth is described in [4] since then the ISO standard Pascal has been 
specified, a very readable account of which is given by Cooper [5l and an IEEE/ ANSI standard 
has also been specified [6]. 

Library Subprograms 

A set of library subprograms, written in Parallel Pascal, is available with this system. 
Documentation for these subprograms is given in section 2 of this manual. These subprograms 
will be automatically inserted into a users program if the correct library file comment 
specifier is given. See the detailed documentation for each library function to obtain this 
specifier. User library subprograms may also be easily used; see the documentation on extern 
in section 1 of this manual for the details of library program formats. 

On a system with a fixed degree of parallelism, such as the Massively parallel Processor 
(MPP), the last dimensions of parallel arrays are fixed by the hardware (the last two dimen- 
sions must be 128 x 128 on the MPP). Arrays for such systems are categorized into four types: 
regular, which exactly match the hardware; large, which have dimensions which are an exact 
multiple of the hardware dimensions; small, which have dimensions smaller than the 
hardware dimensions; and huge, which are too large to fit into fast primary memory and must 
be accessed in regular sized blocks. 



Regular arrays require no special treatment. Large arrays are organized as four dimen- 
sional structures; a set of library subprograms makes the programming of these arrays very 
similar to regular arrays. Progr ammin g for small arrays is usually very simple and applica- 
tion dependent and is currently left to the progra mm er. Huge arrays are very difficult to deal 
with in a general sense; some utilities are available for accessing arbitrarily located chunks 
from large arrays which should be useful for these applications. In some cases different algo- 
rithms should be used depending on where the p rogram is to run; La, on the host computer or 
the parallel hardware, in some cases versions of the library functions to run on a serial host 
computer are also available. 


Implementation 

This development system involves three programs pp, ppt and extern, Pp is a co mm and 
file which controls the compilation process; versions of pp are available for VAX- VMS and 
UNIX. It first invokes extern to insert the bodies (in Parallel Pascal source code) of any 
library subprograms and then invokes ppt to translate the complete Parallel Pascal program 
into standard PascaL If no errors are detected by ppt then the translated program is compiled 
and linked with the host computers standard Pascal compiler and linker. 

Ppt is a Parallel Pascal to Pascal translator, it is written in PascaL It will only translate 
a subset of the Parallel Pascal language; see the documentation on ppt for restrictions. 

Extern is a library preprocessor; it is written in C. A problem with the standard Pascal 
language is that there is no subprogram library facility. Library subprograms are referenced 
in a user program by a formal declaration statement and an extern directive. Additional infor- 
mation may be passed with the extern statement to allow, for example, a library subprogram 
to be used with different sized arrays. Library subprograms are specified as a Parallel Pascal 
body with a dummy declaration statement. Users can easily develop their own library sub- 
programs. 

This development system has been implemented on VAX-11/780 computers for both 
VMS and UNIX operating systems. It should work on any UNIX system which has a large 
enough memory and a Pascal compiler. Ppt has also been run on a Perkin Elmer computer. For 
a different computer system a Pascal compiler is necessary to compile ppt and to run the code 
it generates, pp will have to be rewritten for the host operating system and, for the automatic 
insertion of library subprograms, a C compiler will be necessary to compile extern. 


References 

1. A- P. Reeves, “Parallel Pascal: An extended Pascal for Parallel computers,” Journal of 
Parallel, and Distributed Computing 1 pp. 64-80 (1984). 

2. A. P. Reeves and J. D. Bruner, “The Language Parallel Pascal and Other Aspects of the 
Massively Parallel Processor,” Cornell University Technical Report (December 1982). 

3. J. D. Bruner, “Efficient Implementation of a High -level Language on a Bit-Serial Parallel 
Matrix Processor,” PhD. Thesis, Purdue University (1982). 

4. K. Jensen and N. Wirth, PascaL User Manual and Report, Springer-Verlag (1976). 

5. D. Cooper, Standard PascaL Users Reference Manual, W. W. Norton (1983). 

6. , American National Standard Pascal Computer Programming Language, IEEE (1983). 



EXTERN(l) PPS-PDS Users Manual EXTERN ( 1 ) 


NAME 

extern — pascal external subprogram preprocessor 
SYNOPSIS 

extern [-s] libl lib2 libn < infile >outfile 
DESCRIPTION 

Extern, reads a source program on tlie standard input and replaces external function refer- 
ences •with the source code from library files. Library subprograms are specified in the source 
program by a function or procedure statement followed by an extern directive. Up to 100 
string constants (parameters) may be substituted for in the source code for each 
function/procedure. The modified source program is written to standard output. 

The external function source code is taken from library files which may be specified in the 
command line for extern as shown in the synopsis or in library commands in the program 
text. A program library command has the following format: 

{$library libname } 
or 

(*$library libname *) 

where libname is the name of a library file. The program library commands are in the form 
of Pascal co mm ents so they we have no effect on a subsequent Pascal or Parallel Pascal com- 
piler. 

The library file names usually have a .pi extension- Extern first checks if a specified library 
file name (or path name) is in the current directory. If no file is found, it then checks for the 
same name with a .pi extension- It then checks in the system library directory first with the 
file name as specified and then with a .pi extension. A library not found error is reported if 
the file has still not been found. 

Extern is designed to be used with any Pascal compiler; however, it has a special listing 
feature when used with the Parallel Pascal Translator ppt(l). By default, it inserts ppt flags 
(in comment statements) into the output file which prevents the inserted subprogram bodies 
from appearing in the listing file. In this way, the listing file corresponds to the original 
source file in both content and line numbers; any errors detected in the library functions will 
still be reported. If the -s option is specified then all of the output file will appear in the sub- 
sequent listing file. 

Extern is rather primitive and does not understand much about Pascal syntax- External func- 
tion and procedure declarations must start on a new line and the the declaration and extern 
directive, with any required additional parameters, should appear consecutively without any 
comment statements. (Co mm ent statements bounded by ’{ }’ which do not contain any semi- 
colon symbols ’? should be oJcJ. The key words ’function', ’procedure’, ’extern' and 'library' 
are only recognized if typed in lower case letters. 

MULTIPLE INSTANCIES OF A FUNCTION 

There are many cases when we may wish to have the same subprogram defined to operate on 
arguments having different types in the same program block. Extern has a facility for 
defining multiple versions of a library subprogram- The function name in the source program 
can be followed by a dot (.) and a single character version identifier, e.g. function conv3 .0, 
conv3.1. All instances of the subprogram usage in the source program body must have the 
correct version identifier. 

LIBRARY FILE FORMAT 

Each function in a library file is delimited by a pound sign (#), as follows: 

#funcl 
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{local vars, body of fund, including arbitrary references to 
the variables SO through $99 as needed} 

#func2 

{local vars, body of funcl, including arbitrary references to 
the variables SO through S99 as needed} 

?fend 

The library function specification is similar to a pascal subprogram without the procedure or 
function statement as this is provided by the source program. 

The S variables are used to specify types which are taken from the declaration of the function 
in the source program. Types or constants which do not appear in the argument list may be 
specified after the extern statement. The following example illustrates how $ variables are 
obtained from the function definition in the source file. 

function conv3Cvarl.typel,var2.type2)retumtype; extern type3, type4,type5; 

$0 is always the return type of the function. In this case, 

$l=typel, $2=*type2, S 3= type 3, $4~type4, $5=type5. 

Extern will also preprocess external procedure definitions; in this case, SO is not defined. 
AUTHORS 

Steve Elias, Anthony P. Reeves 

BUGS 

key words are only recognized in lower case letters. 

external functions and procedures must each be declared starting on a new line, 
external function and procedure declarations should not contain comments. 
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NAME 

pp — Parallel Pascal translator and compiler 
SYNOPSIS 

pp name [-i] [-s] [libl lib2 _] 

DESCRIPTION 

Pp is the Parallel Pascal compiler. If given a file name ending in .pp it will compile the file 
and load it into an executable file having the .same name as the original file without the .pp 
extension. 

If the -i option is specified then the Pascal translator pi is used; the interpreter code is loaded 
in to the file obj for interpretation by px. For program development with small data sets the 
-i option is strongly reco mm ended as it is usually much faster. 

If any file names without a .pp extension are specified then these are assumed to be library 
files. Relevent subprograms will be extracted from these files by the extem(l) library proces- 
sor. 

If the -s option is specified then the complete library functions will be written to the listing 
file "pplist". In this case the listing line numbers will not correspond to the source program 
line numbers. 

Pp uses the Parallel Pascal translator ppt to convert the named file into a conventional Pascal 
program which is stored in a file having the same name as the input file but with a .p instead 
of the .pp extension. A listing of the translated program including any error messages is 
stored in pplist. There are several Parallel Pascal language restrictions with the translator; see 
ppt(l) for details. 

The translated program is compiled with pc by default and by pi if the -i option is specified. 
If any errors are detected by ppt no further compilation takes place. 

FILES 


file.pp 

input file 

file.p 

translated Pascal file 

file 

output file from pc 

obj 

output file from pi 

pplist 

Parallel Pascal listing 


SEE ALSO 

ppt(l), pc(l), pi(l), pxCl), extem(l) 
AUTHOR 

Anthony P. Reeves 
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NAME 

ppt — Parallel Pascal translator 
SYNOPSIS 

ppt < infile >outfQe 
DESCRIPTION 

Ppt is a Parallel Pascal translator which, translates a Parallel Pascal program into a conven- 
tional Pascal program- Ppt is itself written in Pascal (based on the P4 compiler) for portabil- 
ity. 

Ppt reads a Parallel Pascal source program from the standard input and writes the translated 
program onto the standard output. The listing of the source program, including any error 
messages, is stored in a file called pplist. Any errors are also reported to the ter mina l. Ppt 
only translates a subset of the Parallel Pascal language; see the bugs section for details of the 
language restrictions. WARNING - The first pass of ppt checks for a valid Parallel Pascal pro- 
gram but does not check, for all the restrictions and limitations of the second pass. 

Translator options are written as comments intermixed throughout the source program. Com- 
ments are designated as option comments by a S-character as the first character of the com- 
ment as follows: 

(*$ Option sequence > <any comment > *) 

Example: (»$l+m- *) 

The option sequence uses commas to separate options. Each option consists of a letter indicat- 
ing the option desired and either a plus (+) or minus (-) to indicate whether to activate or 
deactivate the option. 

The following options are presently available: 

1 - Create a listing of the source program as the syntax analysis is being conducted. Default 
is 1+. 

m - Map upper case letters to lower case letters. (All reserved words and standard functions 
or procedures are only recognized in lower case.) Default is m-. 

c - Output both the translated code and the original code to the standard output. (The origi- 
nal code is written out within comments to not affect the actual program.) Default is c-. 

u-T-rr - 

Limit the line length of the translated output to xxx characters where possible. The 
range of acceptable values is from 80 to 132 characters: Default is ul32» 

The options take effect directly after being issued and therefore they may be selectively 
applied to the program sections. 

Throughout the design of the translator emphasis was put on the production of reliable and 
predictable translations and not on the production of efficient translations. There are some 
necessary optimizations which were implemented that greatly reduce the amount of both data 
space and instruction space used. These included temporary variable reusage and type match- 
ing for Parallel Pascal functions. 

An additional feature of ppt is the ability to declare a function or procedure as being external 
to the program where it is called. The syntax for an external declared function or procedure 
is shown telow: 

procedure identifier ( parameter list \ extern; 
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- or - 


function identifier ( parameter list y. result t y pe; extern; 

FILES 

infile Parallel Pascal inputfile 

outfile Pascal output file 

pplist file containing listing 

DIAGNOSTICS 

A detected error is indicated by placing an arrow followed by a number (which corresponds 
to the appropriate error message) below the source line where the error was detected. For 
each error number issued throughout the source pr o g r a m the corresponding error message is 
printed out at the end of the listing. Error messages for Standard Pascal were obtained from 
the Pascal User Manual and Report by Jensen and Wirth, the error messages added for the 
Parallel Pascal features are listed below. 


350: must be parallel array type 

351: illegal type for parallel array 

360: parallel arrays not compatible 

361: parallel array not compatible with controlling array 

362: result must be parallel type 

363: parallel array not allowed 

364: function result type must be parallel array 

365: dimension not compatible with array 

366: integer constant expected 

367: at least one dimension expected 

368: bit index type must be integer 

369: error in number of standard function arguments 

370: subrange exceeds array index limits 

371: set type not compatible with array index type 

372: index type must be scalar, subrange or set 

373: bit indexing not allowed 

374: illegal array type for bit indexing 


AUTHORS 

Anthony P. Reeves, Tony M. Brewer, Steve Elias, Mike Vemick 
RESTRICTIONS 

1. Constant subranges are not implemented 

2. Bit indexing is only implemented for types integer and subrange. 

3. Parallel arrays cannot have more than nine dimensions. 

4 The index type of a parallel array must be integer. 

5 All parallel identifiers declared by ppt have "pll" as a prefix to the identifier name. 
Therefore a user identifier should not begin with this prefix. Also, standard parallel 
functions are translated to have the same nam e but ending with a numeral; therfore 
names of this form sould be avoided. 

6 Input and output: Read, readln, write and writeln may have at most one parallel param- 
eter which must be the first parameter of the I/O procedure except for possibly a file 
specifier. If readln or writeln is used, then each element of the array would be read or 
written using readln or writeln respectively not just the last element. All other 
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introduction 

Parallel Pascal is an extended version of Pascal for programming parallel computers. This 
summary briefly describes the extensions which have been made to PascaL 

PARALLEL EXPRESSIONS 

In Parallel Pascal all conventional expressions are extended to array data types. In a 
parallel expression all operations must have conformable array arguments. A scalar is con- 
sidered to be conformable to any type compatible array and is conceptually converted to a 
conformable array with all elements having the scalar value. For example, given the 
definition 

var a, b, c array [1.10] of integer; 

the following statement 

a :=> b + c + 1; 

is equivalent to 

for i .*=» 1 to 10 do 

a[i] b[i] + cfi] + 1; 

In many highly parallel computers including the MPP there are at least two different 
primary memory systems; one in the host and one in the processor array. Parallel Pascal pro- 
vides the reserved word parallel to allow programmers to specify the memory in which an 
array should reside. In standard Pascal an array type is specified with the following syntax 

type newtype =» array [indextype] of eltype; 

where indextype specifies the number and ranges of the array dimensions and eltype specifies 
the type of the array elements. A parallel array type is specified with the syntax 

type newtype - parallel array [indextype] of eltype; 

The parallel specifier exists only to provide information to the the compiler as to the variables 
usage. In all usage in the language a parallel array is indistinguishable from a conventional 
array. In some systems there is no distinction between host and processor memories, then the 
parallel specifier does not have any effect. In any case, a compiler may decline to store the 
array where requested. 


ARRAY SELECTION 

Selection of a portion of an array by selecting either a single index value or all index 
values for each dimension is frequently used in many parallel algorit hms; e.g, to select the ith 
row of a matrix which is a vector. Specification of a single index value is the standard 
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indexing method in standard Pascal. In Parallel Pascal all index values can be specified by 
eliding the index value for that dimension. For example, given the definition 

var a,b: array [1-5,1-10] of integer; 

in Parallel Pascal the statement 

a£,l] >■ b(,4i 

assigns the fourth col umn of b to the first column of a. The following are valid statements in 
standard Pascal 

a >■ b; 

all] := b[2l 

The second statement means assign the second row of b to the first row of a; in Parallel Pascal 
this could also be specified by 

allj » b[2,t 


SUBRANGE CONSTANTS 

It is sometimes necessary to move data between arrays with different dimensions. In 
Parallel Pascal subarrays consisting of consecutive sets of elements may be specified. If subar- 
rays with other than consecutive elements are required then they must be packed into the 
consecutive form with permutation functions. The concept of a constant subrange is intro- 
duced in order to specify a consecutive subset of index values. 

The syntax for the constant subrange is 
const identifier =■ low-high; 

where low and high are either literals or previously defined constant identifiers. 


SUBRANGE INDEXING AND ARRAY PACKING 

Subrange constants may be used to index an array in Parallel Pascal. The general syntax 
for a subrange index is 

array-identifierf offset @ subrange-constant ] 

where offset is an optional conventional scalar index expression. The ordered set of indices 
specified by a subrange index is the result of adding the value of the offset expression to the 
values implied by the subrange constant. For example, given the definition 

var a, b: array [1-10] of integer; 

the statement 

a[@2 -6] > b{3 @ l-5t 

is functionally equivalent to 

for i :=> 1 to 5 do 
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afi + l] := bCi + 31 

The main reason, for introducing subrange indexing was to permit blocks of data to be 
transferred between arrays having different dimensions. It was not designed to be a tool for 
algori thm development. 


ARRAY CONFORMA BHiTY 

In standard Pascal, data items combined together in an expression must be type compati- 
ble. In Parallel Pascal, array data items in a parallel expression must also be conformable, Le. 
have the same r ank (number of dimensions) and the same range in each dim ension. For exam- 
ple, given the definitions 

var a, bt array [l— 10] of integer; 
c array [0-9] of integer; 

the statement 

a > a + b; 

is conformable, while the statement 
a .•« b + c; 

is not conformable since the specified ranges of b and c are different. 

While the exact range confonnability requirement is in keeping with the strong typing 
concepts of standard Pascal, there are occasions when the action specified by the above state- 
ment is useful. The range requirement can be explicitly circumvented by using subrange 
indexing. For example, the statements 

a :=* b + c{@0_9i 
a{@l-10] » bt<§l-10] + c; 
a(@U10] > b(@l-10] + c£@0_9i 

are all conformable and have the same effect. 


REDUCTION FUNCTIONS 

Array reduction operations are achieved with a set of standard functions in Parallel Pas- 
cal which are listed in table 1. 


Table 1: Reduction Functions 


Syntax 

Meaning 

sumCarray, Dl, D2, Dn) 
prodCarray, Dl, D2, Dn) 
all(array, Dl, D2, Dn) 
any(array, Dl, D2, _, Dn) 
max(array, Dl, D2, _, Dn) 
min(array, Dl, D2, _ Dn) 

reduce array with arithmetic sum 
reduce array with arithmetic product 
reduce array with Boolean AND 
reduce array with Boolean OR 
reduce array with arithmetic maximum 
reduce array with arithmetic minimum 
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The first argument of a reduction function specifies the array to be reduced and the fol- 
lowing arguments specify which dimensions are to be reduced- A dimension is specified by a 
constant expression; the first dim ension is dimension 1 . The dim ension parameters must be 
constant expressions so that the shape of the result is known at compile time. 

For example, given the the definitions 
var 

a: array{l-10,l-53 of integer; 
b: array{l_10] of integer; 
c integer; 

the following are correct Parallel Pascal statements 

b sum(a, 2>, (* sum the rows of a *) 

c sum(a, 1, 2); (* sum all elements of the array a *) 
c > max(b, l); (* find the maximum value of b *) 

Each dimension parameter of a reduction function implies that there will be one less dimen- 
sion in the result array; a scalar is considered to be an array without any dimensions in this 
context. 


PERMUTATION AND DISTRIBUTION FUNCTIONS 

One of the most important features of a parallel programming language is the facility to 
specify parallel array data permutation and distribution operations. In Parallel Pascal four 
such operations are available as primitive standard functions; however, for some Parallel Pro- 
cessors it may be necessary to specify more primitive functions for efficiency. The standard 
Parallel Pascal functions for data permutation and distribution are given in table 2. 


Table 2: Permutation and Distribution Functions 


Syntax 

Meaning 




SHIFT AND ROTATE 

The shift and rotate primitives are found in many parallel hardware architectures and 
also, in many algorithms. The shift function shifts data by the amount specified for each 
dim ension and shifts zeros (null elements) in at the edges of the array. Elements shifted out 
of the array are discarded. The rotate function is similar to the shift function except that 
data shifted out of the array is inserted at the opposite edge so that no data is lost. The first 
argument to the shift and rotate functions is the array to be shifted; then there is an ordered 
set of parameters, each one specifies the amount of shift in its corresponding dimension. There 
must be as many shif t parameters as there are dimensions in the array; the first shift parame- 
ter is associated with the first dimension of the array. 

For e xam ple, given the definitions 
var 

a, b: array [l_5,0_9] of integer; 
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c, <± array [0-9] of integer; 

the statement 

a > shiftCb, 0, 3); 

is functionally equivalent to 

for i .*= 1 to 5 do 
begin 

for j .*= 0 to 6 do 
a&j] > b(i,j+3i 
for j > 7 to 9 do 
afi.j] > 0; 

end; 

and the statement 

c > rotateCd, 3); 

is functionally equivalent to 

for i > 0 to 9 do 
cfi] :=> d[G + 3) mod 10l 


TRANSPOSE AND EXPAND 

While transpose is not a simple function to implement with many parallel architectures, 
a significant number of matrix algori thms involve this function; therefore, it has been made 
available as a primitive function in Parallel Pascal. The first parameter to transpose is the 
array to be transposed and the following two parameters, which are constant expressions, 
specify which dimensions are to be interchanged. If only one dimension is specified then the 
array is flipped about that dimension. 

The main data distribution function in Parallel Pascal is expand. This function increases 
the rank of an array by one by repeating the contents of the array along a new dimension. 
The first parameter of expand specifies the array to be expanded, the second parameter, a con- 
stant expression, specifies the number of the new dimension and the last parameter, a subrange 
or a subrange type, specifies the range of the new dimension. 

This function is used to maintain a higher degree of parallelism in a parallel statement; 
this may result in a clearer expression of the operation and a more direct parallel implementa- 
tion. In a conventional serial environment such a function would simply waste space. 

For example, given the definitions of a, b, and c as specified in section 5.1 the following 
statement adds a vector to all rows of a matrix 

a :=» b + expandCc, 1, 1-5); 

The above statement is functionally equivalent to the following 

for i := 1 to 5 do 
afU > b(ij + q 
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CONDmONAL EXECUTION 

An important feature of any parallel programming language is the ability to have an 
operation operate on a subset of the elements of an array. In standard Pascal each array ele- 
ment is processed by a specific sequence of statements and there are a variety of program con- 
trol structures for the repeated or selective execution of statments. In Parallel Pascal the 
whole array is processed by a single statement; therefore, an extended program control struc- 
ture is needed. 

The syntax of the Parallel Pascal where statement is as follows: 

where array-expression do 
statement 
otherwise 
statement 

where array-expression is a Boolean valued array expression and statement is a Parallel Pascal 
statement. The otherwise and the second controlled statement may be omitted. 

The execution of a where structure is defined as follows. First, the controlling expres- 
sion is evaluated to obtain a Boolean array (mask, array). Next, the first controlled statement is 
evaluated. Array assignm ents are masked according to the boolean control array. If there is 
an otherwise statement it is then evaluated; in this case array assignments are masked with 
the inverse of the control array. 

For example, given the definition 

var a, b, carray [1-10] of integer; 

the following expression 

where a < b do 
c >■ b 
otherwise 
o a; 


is functionally equivalent to 

for i > 1 to 10 do 
if a[i] < b(i] then 

c£i] » bCi] 
else 

c£i] > a£ii 

The main semantic difference between the where-do-otherwise structure and the if- 
then-else structure is that with the former both controlled statements are evaluated, 
independent of the value of the control expression, while with the latter only one of the two 
controlled statements is evaluated. 

Where statements may be nested provided that all of the controlling array expressions 
are type compatible. Other standard Pascal control statements can also be nested within 
where statements. Any array variable which appears on the left hand side of an assignment 
within a where controlled statement must be type compatible with the controlling array 
expression. Assignments to other than array variables in a where statement are in no way 
affected by the where statement. The effect of a where statement is local to the procedure or 
function in which it occurs; that is, it does not affect the execution of any procedures or func- 
tions called from wi thin a where statement or an otherwise statement. 
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Parallel Pascal 


PPASCAL(l) 


BIT-PLANE INDEXING 

A feature of several current highly parallel computers such as the MPP is that arith- 
metic is conducted at the bit level rather than the word or number level. That is, the com- 
puter " word" or bit plane mani pulate by these computers is a single bit slice through all ele- 
ments in the array being processed. 

Some algorithms can be made considerably more efficient for these computers if specified 
at the bit plane leveL Bit-plane indexing was added to Parallel Pascal to enable a programmer 
to conveniently specify most of these special algorithms without resorting to an assembly code 
subroutine. 

A bit-plane index is specified by the last item in an index expression and is separated 
from other indices by a colon. The result of a bit-plane indexed array has a Boolean element 
type. For example, given the definition 

var a: array [1-5,1-10] of integer; 
var b: array [1-5,1-10] of Boolean; 

then the statement 

b > a[:0l 

is equivalent to 

b >» odd(a); 

The next example subtracts one from the selected array element if necessary to make it 
exactly divisible by 2. 

a[3,ld)] > false; 

The least significant or first bit-plane is always bit-plane 0. Programming with bit-plane 
indexing requires a knowledge of the internal number representation of the parallel processor 
and is a highly non portable feature. Furthermore, bit-plane indexing on a processor which 
does not operate at the bit level is usually very inefficient. 
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NAME 

allm — masked all reduction functions 
SYNOPSIS 

{ ^library reduce.pl } 

function aliml( arg, maVir; vec ); boolean; extern; 
function allm2( arg, mask: mat ): boolean; extern; 
function allm3( arg, mask: arr3 ): boolean; extern; 
function allm4( arg, masV; arr4 )i boolean; extern; 
function allm5( arg, mask: arr5 ): boolean; extern; 

TYPES 

vec: a vector of type boolean 

mat: a matrix of type boolean 

arr3: a three dimensional array of type boolean 

arr4: a four dimensional array of type boolean 

arr5: a five dimensional array of type boolean 

DESCRIPTION 

These functions reduce the first argument where there are true elements in the boolean mask 
second argument. All dimensions are automatically reduced and the final result is always a 
scalar value. The range of the dim ensions of arg and mask must match. The following func- 
tions are available: 

allml all reduction for vector arguments 
allm2 all reduction for matrix arguments 
allm3 all reduction for three dimensional arrays 
allm4 all reduction for four dimensional arrays 
allm5 all reduction for five dim ensional arrays 
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NAME 

anym — masked any reduction functions 
SYNOPSIS 

{ Slibrary reduce.pl } 

function anyml( arg, mask: vec ): boolean; extern; 
function anym2( arg, mask: mat ): boolean; extern; 
function anym3( arg, mask: arr3 ): boolean; extern; 
function anym4( arg, mask: arr4 )i boolean; extern; 
function anym5( arg, mask: arr5 ): boolean; extern; 

TYPES 

vec a vector of type boolean 

mat: a matrix of type boolean 

arrl a three dimensional array of type boolean 

arr4: a four dimensional array of type boolean 

arr5: a five dimensional array of type boolean 

DESCRIPTION 

These functions reduce the first argument where there are true elements in the boolean mask 
second argument. All dimensions are automatically reduced and the final result is always a 
scalar value. The range of the dimensions of arg and mask must match- The following func- 
tions are available: 

anyml any reduction for vector arguments 
anym2 any reduction for matrix arguments 
anym3 any reduction for three dimensional arrays 
anym4 any reduction for four dimensional arrays 
anym5 any reduction for five dim ensional arrays 
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NAME 

blint — Bilinear interpolation procedure for a matrix 

SYNOPSIS 

{Slibrary blint.pl} 

procedure bliat( ro,co: real; rp,cp: pli; rf,cf: plr; msk: plb; b,d: plr; var result: plr); 
extern mxrowl, mxrowh, mxcoll, mxcolh; 

TYPES 

plr = array [mYrn^ l-m-rm -w- h.rny cn ll-rnT Colhl of btype; 
pli =■ array [mYT rmr l.mYr nwh ; mYcn11miYcnlh1 of itype; 
plb => array [mYT nw l-mYT nw h,mY cn ll-mY cn1h] of boolean; 

Where btype is any type and itype is an integer or subrange base type 

EXTERN CONSTANTS 

mxrowl - the smallest row number of the input matrix 
mxrowh - the largest row number of the input matrix 
mxcoll * the sma ll est column number of the input matrix 
mxcolh “ the largest column number of the input matrix 

DESCRIPTION 

Blint is a procedure that can be used in conjunction with irotate (see rotation(2)). It performs 
a local search to find three out of four vertices of a square that contains the point whose value 
we want to interpolate. The fourth point is the top left comer of the square and is immedi- 
ately defined from the matrices rp and cp (row and col umn coordinates) and matrix b (value 
of the point). Matrix d is a horizontally shifted version of matrix b. Matrices rf and cf contain 
the coefficients used to perform the interpolation. The false values in matrix msk indicate the 
positions in the rotated matrix that will be filled with zeros. The coordinates ro and co are the 
center of the rotation computed in irotate. 

The bilinear interpolation is performed in the following way 
ta 4.1c cf 

pi * f * P2 
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P - (1 - cf>(l - rf>Pl + (1 -rf>cf*P2 + (l - cf>rf*P3 + cf*rf*P4 
for all points P in the matrix. 

AUTHOR 

Cristina Moura 

SEE ALSO 

rotation(2),lblint(2),cint(2) 
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NAME 


ceiling — round up to integer value 
SYNOPSIS 

{$library math.pl } 

function ceiling( xmtype ): rtype; extern; 


TYPES 

atype: an arbitrary shaped array of element type real 

rtype: an array with similar dimensions to atype of element type integer or subrange 


DESCRIPTION 

Ceiling converts a real array to an integer array rounding each element to the smallest integer 
not less than the element. 
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NAME 

cint — Cubic interpolation procedure for a matrix 

SYNOPSIS 

{^library cint.pl} 

procedure cint( ro,co: real; rp,cp: pli; rf,cf: plr; mskr plb; b,d: plr; var result: plr); 
extern mxrowl, mxrowh, mxcoll, mxcolh; 

TYPES 

plr = array [mTmw1_mYTnwh,mTcn11-rnYcn1h] of btype; 
pli - array [mTrow l-mTT Tvwh.mTenll-mTcnlh] of itype; 
plb - array [mxT nw 1-mYT nw h r mT cnll-mTer>lh] of boolean; 

Where btype is any type and itype is an integer or subrange base type 

EXTERN CONSTANTS 

mxrowl » the smallest row number of the input matrix, 
mxrowh - the largest row number of the input matrix 
mxcoll - the smallest column number of the input matrix, 
mxcolh - the largest column number of the input matrix. 

DESCRIPTION 

Cint is a procedure that can be used in conjunction with irotate (see rotation(2)). It performs a 
local search to find fifteen out of sixteen points located around the point whose value we want 
to interpolate. The sixteenth point is imme diately defined from the matrices rp and cp (row 
and column coordinates) and matrix b (value of the point). Matrix d is a horizontally shifted 
version of matrix b. Matrices rf and cf contain the coefficients used to perform the interpola- 
tion. The false values in matrix msk. indicate the positions that will be filled with zeros in the 
rotated matrix. The coordinates ro and co are those of the center of the rotation computed in 
irotate. 

The cubic interpolation for sixteen points is done by, first, performing a cubicinterpolation for 
each one of the four rows and, then, based on the points obtained, performing a fifth cubic 
interpolation. 
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We already have the value and position of point P 6 through matrices rp,cp and b. The first 
four cubic interpolations are performed to obtain points pa, pb, pc and pd. The fifth one yields 
the value of point P. 

AUTHOR 

Cristina Moura 
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NAME 

compn — near neigbcr comparison function 

SYNOPSIS 

{Slibrary mx.pl } 

compn (mrmtype; wrwtype ):mtype; extern size; 

TYPES 

mtype a parallel array [lol Jail, lo2-hi2] of boolean; 
wtype array [0-size, 0-size] of 0-2; 

DESCRIPTION 

Compn compares the local neighborhood of each element of the boolean input matrix m with 
the window w. If a match occures then the result element is true, otherwise it is false. A zero 
in w matches with false in a, a one in w matches with true in m and a two in w is a don’t 
care. 

AUTHOR 

A. P. Reeves 
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NAME 

conv,convg — matrix convolution functions 
SYNOPSIS 

{$libraxy convolve } 

function conv(matrixnntype; kemekktype )rrtype; extern size; 

function convgCmatrixnngtype; kemeltkgtype )trgtype; extern kll, kh.1, kl2, kh2, 
mshift; 

TYPES 

rtype =» parallel array [lol-hi2, lol-hi2] of rbtype; 
mtype =» parallel array [lol Jiil, lo2-hi2] of mbtype; 
ktype =» array [O-size-l, 0-size-l] of kbtype; 

mgtype =» parallel arxay{ idxtype ] of mbtype; 
rgtype - parallel array[ idxtype ] of rbtype; 
kgtype =» array{kll_kl2, kl2_kh2] of kbtype; 

where rtype, mbtype and kbtype are base types; rbtype must be conformable with the pro- 
duct of a kbtype with a mbtype. 

EXTERN CONSTANT 

size is a constant specifying the size of the kernel. 

DESCRIPTION 

Conv is a convolution function which convolves a small matrix with a large matrix. The 
small matrix is typically stored in the host computer. The result matrix has the same dimen- 
sions as the large matrix. For the highest large matrix index values the small matrix will 
overlap the large matrix edge. Zero values are used for large matrix elements beyond this 
edge. If it is desired that the result matrix is centered on the input matrix; Le. a result ele- 
ment is computed from the same spatially located element in the input matrix and its near 
neighbors, then this may be achieved by preshifting the input matrix or postshifting the 
result matrix. 

Convg is an extended version of conv. For this function the kernel index range is not con- 
strained to start at zero and is explicitly specified by the extern parameters. Furthermore, the 
shift function used by convg is also specified by an extern parameter. For example, lshift or 
crshift could be specified which would indicate that the function is to operate on large 
(blocked or crinkled) matrices. If any function other than standard shift function is to be 
used then it must be formally declared before convg is declared. 

AUTHOR 

Gary Ross and A. P. Reeves 

SEE ALSO 

lshift(2), crshift(2) 
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NAME 

crshift,crrotate — shift a large crinkled array on a parallel computer 
SYNOPSIS 

{Slibrary lsMft.pl } 

function crsMft( a:1 array; r f c: integer ):larray; 
extern n, m, prow, pcol, la bool; 

function crrotate{ affarray; r,c: integer );larray; 
extern n, m, prow, pcol, la bool; 

TYPES 

larray = parallel array [ 0_n, 0_m, 0-prow, 0-pcol ] of btype; 
labool =* parallel array [ 0-n, 0-m, 0-prow, 0-pcol ] of boolean; 

where btype is any base type. 

DESCRIPTION 

Crshift is a large matrix shift function for arrays stored with the cr inkl ed format for paral- 
lel processors, such as the MPP, which have a fixed range of parallelism for the two parallel 
dimensions. Crrotate is similar to crshift except that a rotate operation rather than a shift 
operation is performed. The second and third arguments to crshift specify the amount the 
matrix is to be shifted in each dim ension in a similar way to the standard shift function. 

The large matrix is stored in a four dim ensional array structure; the last two dim ensions of 
this array specify the block size which can be directly processed in parallel by the hardware. 
The range of the first dim ension of this array specifies the number of blocks in each col umn of 
the large matrix and the second dimension specifies the number of blocks in each row. There- 
fore, the dimensions of the large matrix are C(n + l) * (prow + l)) by (Cm + l) * (pcol + l)). 

With the crinkled storage scheme, adjacent elements of the large matrix are stored in different 
blocks; single element shifting is slightly more efficient with this format when compared to 
storing adjacent elements in blocks. The crinkled storage scheme is illustrated with the fol- 
lowing example. Consider that we have a large matrix ,mx, which is conceptually specified as 
follows; 


ms array{0-x,0_y] of btype; 
and which is stored in the array 
a: larray; 

the mapping of element i,j of the large matrix into the array a is specified by 
mxfoj] - a[i mod cdl, j mod cd2, i div cdl, j div cd2] 
where 

cdl - a + 1 
cd2 = m + 1 
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NAME 

crshiftg,crrotateg — shif t a large crinkled array on a parallel computer 
SYNOPSIS 

{Slibrary gshift.pl } 


function crshiftg{ a:! array; r,c: integer )darray; 

extern nl, ah, ml, mh, prowl, prowh, pcoll, pcolh, 1a bool; 


function crrotateg( a:larray; r,cs integer ):larray; 

extern nl, nh, ml, mh, prowl, prowh, pcoll, pcolh, labool; 


TYPES 

larray = parallel array [ nLnh, rn1_mh, prowLprowh, pcoILpcolh ] of btype; 
labool » parallel array [ nLnh, prowLprowh, pcoILpcolh ] of boolean; 


where btype is any base type. 

DESCRIPTION 

Crshiftg and Crrotateg are extended versions of the functions Cr shift and Crrotate. With 
these extended functions it is possible to have values other than zero for the first index value 
of array index ranges. 

Crshiftg is a large matrix shift function for parallel processors, such as the MPP, which have 
a fixed range of parallelism for the two parallel dimensions when the cr inkl ed format is used 
for large matrix storage. Crrotateg is similar to crshiftg except that a rotate operation rather 
than a shift operation is performed. The second and third arguments to crshiftg specify the 
amount the matrix is to be shifted in each dimension in a similar way to the standard shift 
function. 

The large matrix is stored in a four dimensional array structure; the last two dimensions of 
this array specify the block, size which can be directly processed in. parallel by the hardware. 
The range of the first dimension of this array specifies the number of blocks in each column of 
the large matrix and the second dim ension specifies the number of blocks in each row. There- 
fore, the dimensions of the large matrix are ((nh-nl+l) * (prowh-prowl+l)) by ((mh-ml+l) * 
(pcolh-pcoll+l)). 

With the crinkled storage scheme blocks do not contain adjacent large matrix elements. This 
method is slightly faster for single element shifts than storing adjacent elements in blocks. 
The c rinkl ed storage format is illustrated with the following example. Consider that we have 
a large matrix ,mx, which is conceptually specified as follows: 

mr array{0-x,0-y] of btype; 

and which is stored in the array 

a: larray; 

the mapping of element i,j of the large matrix into the array a is specified by 
mx{i,j] = a£i mod cdl, j mod cd2, i div cdl, j div cd2] 
where 

cdl => nh - nl + 1 
cd2 = mh - ml + 1 
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NAME 

iconv, rconv, bconv — matrix convolution functions 


SYNOPSIS 

{$library convn.pl } 


function iconv4(matrixritype; kdternel Mtype; extern; 
function rconv2(matrix a r t y p e; ktkernel )triype; extern; 
function bconv5(matrix:btype; krkemel ):btype; extern; 

TYPES 

itype — parallel array [0_x, 0-y] of integer; 
rtype = parallel array [0-x, 0_y] of real; 
btype =» parallel array [0_x, 0-y] of boolean; 

kernel => parallel array [O-size- 1, O-size- l] of (real, integer or boolean); 

NOTES 

1) The type of data specified by kernel must be compatible 'with, the type specified for itype. 

2) The function name specifies the size of the kernel and the type of matrices involved (see 
description). 

DESCRIPTION 

The functions iconv, rconv, and bconv are all convolution functions. Selection of the proper 
function is determined by the data type involved in the convolution. Iconv is used if integer 
values are used, rconv is used with real valued matrices and bconv is used with boolean 
matrices. In addition to specifying the type of matrices involved, the size of the convolution 
kernel must be included in the function name. Fucntions exist to do convolution using kernels 
of size 2x2 to size 5x5. For example to do an integer valued convolution with a kernel of size 
4x4 the function iconv4 would be used. 

AUTHOR 

Gary Ross conv(2) 
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NAME 

irotatejirotate — Rotation matrix generators 
SYNOPSIS 

{{library rotation.pl} 

procedure irotate (var rpqjli; var cprpli; var mskrplb; var rfrplr; Tar cfrplr; 
ro,co,tbeta:btype; idl, id2:pli); extern mxrowl, mxrowh, mxcoll, mxcolh; 

procedure nrotate (var rprpli; Tar cpqpli; Tar mskrplb; ro,co,theta:btype; idl, id2rpli); 
extern mxrowl, mxrowh, mxcoll, mxcolh; 

TYPES 

plr *» array [mxrow l_mxr o wh^nixc o ILmxco lh] of btype; 
pli - array [rnTrow l-myr nw ^mY cnll-mvcrilh] of itype; 
plb - array [mxrowLjnxrowlvnxcoll-mxcoIh] of boolean; 

Where btype is any type and itype is an integer or subrange base type 

EXTERN CONSTANTS 

mxrowl - the smallest row number of the input matrix 
mxrowh - the largest row number of the input matrix 
mxcoll - the smallest column number of the input matrix 
mxcolh =* the largest col umn number of the input matrix 

DESCRIPTION 

Irotate generates permutation matrices for realizing a given rotation transformation. Irotate is 
a version of the rotation procedure to be used in conjunction with the interpolation procedures 
blint or cint (see blint(2) and cint(2)). The rotation is specified by the coordinates of its 
center, ro and co (row and column), and by its angle theta. It returns two matrices containing 
the truncated row and col umn coordinates for rotation of the original matrix, respectively rp 
and cp, and a mask msk whose false elements indicate the positions in the rotated matrix that 
will be filled with zeros. It also returns two matrices containing the differences between the 
original rotated point and its truncated value, both for the row and column coordinates, 
respectively rf and cf. Idl and id2 are two index identifying matrices as created by twodid 
(see mat(2)) 

Nrotate is a version of the rotation procedure to be used in near neighbor operations. It 
returns the same arguments as irotate except for rf and cf. 

AUTHOR 

A. P. Reeves 

SEE ALSO 

mat(2), blint(2), cint(2) 
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NAME 

lblint — Bilinear interpolation procedure for a large matrix 

SYNOPSIS 

{$library lblint.pl} 

procedure lblint( ro,co: btype; rp,cp: LINT; rf,cf: LAURA Y; msk; LBOOL; b,d: LINT; 
var lresult: LARRAY); extern nl,n,m l > m,NL»NH > MI^MH > naT; 

TYPES 

LARRAY - array [NLJVJH^ILMtLPROWL-PROWILPCOLL-PCOLH] of btype; 

LINT - array [NLuNHAlLAIH^ROWL-PROWILPCOLL-PCOLH] of itype; 

LBOOL - array [NL^NH^L^MHJ>ROWIJROWH4<:OLL-PCOLH] of boolean; 
nar - array [NL-NHADLAIH] of boolean; 

Where btype is any type and itype is an integer or subrange base type 

DESCRIPTION 

Lblint is a modified version of blint that can be used in conjunction with lirotate (see Rota- 
tion^)). It performs a local search to find three out of four vertices of a square that contains 
the point whose value we want to interpolate. The fourth point is the top left comer of the 
square and is immediately defined from the matrices rp and cp (row and column coordinates) 
and matrix b (value of the point). Matrix d is a horizontally shifted version of matrix b. 
Matrices rf and cf contain the coefficients used to perform the interpolation. 

The bilinear interpolation is performed in the following way: 
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P - (1 - cf>(l - rf>Pl + (1 -rf>cf*P2 + (1 - cf>rf*P3 + cf*rf*P4 
for all points P in the matrix. 

The large matrices are stored in a four dimensional array. The last two dimensions of this 
array specify the block, size which can be directly processed in parallel by the hardware. The 
range of the first dimension of this array specifies the number of blocks in each row of the 
large matrix and the second dimension specifies the number of blocks in each column. There- 
fore, the dim ensions of the large matrix are 

((NH - NL + l)* (PROWL - PROWH + 1)) by ((MH - ML + 1>(PC0LH - PCOLL + l)). 

The arguments nl, n, ml and m specify also the dimensions of the large matrix. That is(n- 
nl + l) by Cm - ml + l). 

The function Irotateg (Slibrary large.pl) has to be declared before lblint can be used. 

AUTHOR 

Cristina Moura 

SEE ALSO 

lrotation(2),blint(2),cint(2) 
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NAME 

lcint — Cubic interpolation procedure for a large matrix 

SYNOPSIS 

{Slibrary lciat.pl} 

procedure lciat( ro,co: btype; rp,cp: LINT; rf,cf: LAURA Y; msk: LBOOL; b,d: LINT; 
Tar lresult; LARRAY); extern mxrowl, mxrowh, mxcoll, mxcolh, NL, NH, ML, MH, 
riar; 


TYPES 

LARRAY = array [NL-NH^lL^MH^ROWL^ROWHJKrOLL-PCOLH] of btype; 

LINT - array [NL-MLML^MHJ’ROWL^ROWH^OLL-PCOLH] of itype; 

LINT - array [NL^N1IA^\1HJ 3 R0WI^R0WHJX:0LUJK:0U!] of boolean; 
nar - array [NL-NILML-MH] of boolean; 

Where btype is any type and itype is an integer or subrange base type 

DESCRIPTION 

Ldrtf is a modified version of cint (see cint(2)) that can be used in conjunction with lirotate 
(see lrotatian(2)). It performs a local search to find fifteen out of sixteen points located around 
the point whose value we want to interpolate. The sixteenth point is immediately defined 
from the matrices rp and cp (row and column coordinates) and matrix b (value of the point). 
Matrix d is a horizontally shifted version of matrix b. Matrices rf and cf contain the 
coefficients used to perform the interpolation. 

The cubic interpolation for sixteen points is done by, first, perfo rming a cubicinterpolation for 
each one of the four rows and, then, based on the points obtained, perfor min g a fifth cubic 
interpolation. 
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We already have the value and position of point P 6 through matrices rp,cp and b. The first 
four cubic interpolations are performed to obtain points pa, pb, pc and pd. The fifth one yields 
the value of point P. 

The large matrices are stored in a four dimensional array. The last two dim ensions of this 
array specify the block size which can be directly processed in parallel by the hardware. The 
range of the first dimension of this array specifies the number of blocks in each row of the 
large matrix and the second dim ension specifies the number of blocks in each column. There- 
fore, the dim ensions of the large matrix are 

((NH - NL + l)* (PROWL - PROWH + 1)) by ((MH - ML + 1>(PC0LH - PCOLL + D). 

The arguments mxrowlmxrowh, mxcoll and mxcolh specify also the dimensions of the large 
matrix. That is (mxrowh - mxrowl + l) by (mxcolh - mxcoll + 1). 
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The function lrotateg (Slibrary large.pl) has to be declared before lcint is used. 

AUTHOR 

Cristina Moura 

SEE ALSO 

lrotation(2),blint(2)4blint(2) > cint(2) 
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NAME 

lirotatednrotate — Rotation matrix generators for large matrices 


SYNOPSIS 

{Slibrary lrotation.pl} 


procedure lirotate (var rpiUNT; var cptLINTj var mslcLBOOL; Tar rfiLARRAY; 
var cfiLARRAY; ro, co, thetaibtype; idl, id2:LINT); 
extern NL, NH, ML, MH, PROWL, PROWH, PCOLL, PCOLH; 


procedure Inrotate (Tar rpiLINT; Tar cptLINT; Tar mskiLBOOL; ro, co, thetaibtype; 
idl, id2dJNT); 

extern NL, NH, ML, MH, PROWL, PROWH, PCOLL, PCOLH; 


TYPES 

LARRAY - array [NL-NH^O^MHJROWI^ROWHJ^OIX-PCOLH] of btype; 

LINT - array [NLJSIHA^MHJROWL^ROWHJ^OLL-PCOLH] of itype; 

LBOOL - array [NL-NHA1LA1HJ 3 R0WLU 5 R0WHJ*:0LLAK:0LH] of boolean; 

Where btype is any type and itype is an integer or subrange base type 

DESCRIPTION 

Lirotate is a large matrix version of irotate for interpolation (see rotation(2)). 

It returns two matrices, rp and cp, containing the truncated, rf and cf, row and column coor- 
dinates for rotation of the original matrix, as well as two matrices, rf and cf, containing the 
differences between the original rotated points and their truncated values. 

Lnrotate is a large matrix version of nrotate for near neighbor (see rotation(2)). 

It returns two matrices, rp and cp, containing the row and column coordinates for rotation of 
the original matrix, rounded to its near neighbor coordinates. 

For both lintrotat and lnearotat the large matrix is stored in a four dimensional array 

structure; the last two dimensions of this array specify the block size which can be directly 
processed in parallel by the hardware. The range of the first dimension of this array specifies 
the number of blocks in each row of the large matrix and the second dimension specifies the 
number of blocks in each column. Therefore, the dimensions of the large matrix are 
((NH - NL + 1) * (PROWH - PROWL + l)) by ((MH - ML + l) * (PCOLH - PCOLH + l)). 

SEE ALSO 

mat(2)jotation(2) 
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NAME 

lperm2 - permute data in a large array on a parallel computer 
SYNOPSIS 

{$library perm2.pl } 


function lperm2( rn-g: larray; r,c: lint; msk: lbool ); larray; 

extern nl, nh, ml, mh, prowl, prowh, pcoll, pcolh, btype; 


TYPES 

larray =* parallel array [ n l.nh, ml.mh, prowLprowh, pcolLpcolh ] of btype; 
lint =* parallel array [ nLnh, ml-mh, prowLprowh, pcolLpcolh ] of integer; 
lbool =• parallel array [ nl-n'n, ml.mh, prowLprowh, pcolLpcolh ] of boolean; 
pi =■ parallel array [ prowLprowh, pcolLpcolh ] of integer; 


where btype is any base type. 

VARS 

idl,id2;pi; 

Idl and id2 are globally declared index identifying variables as created by twodid (see 
mat(2)). These should be initialized before using lperm2. 

DESCRIPTION 

Lperm2 is a large matrix permutation function for parallel processors, such as the MPP, which 
has a fixed range of parallelism for the two parallel dim ensions. It is s imil ar in concept to 
perm2(2) but operates on large matrices stored in the blocked format. The second and third 
arguments to lperm2 specify the conceptual large matrix index coordinates for the data to be 
permuted. The fouth argument specifies the permutation conditions - true where permutation 
is possible and false where permutation is out of range. In the false condition, zero will be put 
in place of data in the resulting matrix. 

The large matrix is, in fact, stored in a four dimensional array structure; the last two dimen- 
sions of this array specify the block size which can be directly processed in parallel by the 
hardware. The range of the first dim ension of this array specifies the number of blocks in 
each column of the large matrix and the second dimension specifies the number of blocks in 
each row. Therefore, the dim ensions of the large matrix are ((nh-nl+l) * (prowh-prowl+l)) 
by (Cmh-ml+l) * (pcolh-pcoll+l)). 

SEE ALSO 

mat(2), perm2(2) 
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NAME 

lshiftdrotate — shif t a large array on a parallel computer 
SYNOPSIS 

{Sllbrary lshift.pl } 

function lshift( adarray; r,c: integer Marray; 
extern n, m, prow, pool, labool; 

function lrotate( atlaxray; r,c: integer ):larray; 
extern n, m, prow, pool, labool; 

TYPES 

larray =■ parallel array [ 0-n, 0-m, 0-prow, 0-pcol ] of btype; 
labool = parallel array [ 0-n, 0-m, O-prow, 0_pcol ] of boolean; 

where btype is any base type. 

DESCRIPTION 

Lshift is a large matrix shift function for parallel processors, such as the MPP, which have a 
fixed range of parallelism for the two parallel dim ensions. Lrotaie is simil ar to lshift except 
that a rotate operation rather than a shift operation is performed. The second and third argu- 
ments to lshift specify the amount the matrix is to be shifted in each dimension in a similar 
way to the standard shift function. 

The large matrix is, in fact, stored in a four dim ensional array structure; the last two dimen- 
sions of this array specify the block size which can be directly processed in parallel by the 
hardware. The range of the first dimension of this array specifies the number of blocks in 
each column of the large matrix and the second dimension specifies the number of blocks in 
each row. Therefore, the dimensions of the large matrix are (Cn + l) * (prow + l)) by (Cm + 
l) * (pcol + 1)). 

Each block of the array structure cont ains consecutive elements from the large matrix. For 
example, consider that we have a large matrix ,mx, which is conceptually specified as follows: 

mn array[0_x,0_y] of btype; 

and which is stored in the array 

a: larray; 

the mapping of element i,j of the large matrix into the array a is specified by 
mx (i, j] = afi div cdl, j div cd2, i mod cdl, j mod cd2] 
where 

cdl = prow + 1 
cd2 = pcol + 1 

SEE ALSO 

lshiftg(2) 


A. P. Reeves 


PPL 


1 



LSHIFTG ( 2 ) 


PPS-PDS Users Manual 


LSHIFTG ( 2 ) 


NAME 

lshiftg^rotateg — sliift a large array on a parallel computer 
SYNOPSIS 

{$library gshift.pl } 

function. lshiftg( atlarray; r,c: integer )darray; 

extern nl, nh, ml, mb, prowl, prowh, pcoll, pcolh, labool; 


function lrotateg( a:larray; r^s integer ):larray; 

extern nl, -nh, ml, mb, prowl, prowh, pcoll, pcolh, labool; 


TYPES 

larray = parallel array [ n1_nh, m l.mh, prowLprowh, pcolLpcolh ] of btype; 
labool =• parallel array [ nLn h, mLmh, prowLprowh, pcolLpcolh ] of boolean; 


where btype is any base type. 

DESCRIPTION 

Lshiftg and Irotateg are extended versions of the Ishift and l rotate functions. For these 
extended functions the initial index values of the array ranges are no longer constrained to be 
zero. 

Lshiftg is a large matrix shift function for parallel processors, such as the MPP, which have a 
fixed range of parallelism for the two parallel dimensions. Lrotateg is similar to lshiftg except 
that a rotate operation rather than a shift operation is performed. The second and third argu- 
ments to lshiftg specify the amount the matrix is to be shifted in each dimension in a similar 
way to the standard shift function. 

The large matrix is, in fact, stored in a four dimensional array structure; the last two dimen- 
sions of this array specify the block size which can be directly processed in parallel by the 
hardware. The range of the first dim ension of this array specifies the number of blocks in 
each column of the large matrix and the second dimension specifies the number of blocks in 
each row. Therefore, the dimensions of the large matrix are ((nh-nl+l) * (prowh-prowl+l)) 
by ((mh-ml+l) * (pcolh-pcoll+l)). 

Each block of the array structure contains consecutive elements from the large matrix. For 
example, consider that we have a large matrix ,mx, which is conceptually specified as follows: 

mx: array[0_x,0-y] of btype; 


and which is stored in the array 
a: larray; 

the mapping of element i,j of the large matrix into the array a is specified by 
mx{i,j] - a[i div cdl, j div cd2, i mod cdl, j mod cd2] 
where 

cdl =■ prowh - prowl + 1 
cd2 = pcolh - pcoll + 1 

SEE ALSO 

lshift(2) 
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NAME 

matran<Lrandinit — matrix random number generator 
SYNOPSIS 

{Slibrary matrand } 
function matrand; extern 

dinyrymm l,mrand,mbooI,pahrang,coef,ncoef .state, window; 

procedure randinit; extern 

dim,nym.m l > mrand > mbool t pai > rang,coef ncoef .state, window; 

TYPES and VARS 
const 

dim = 128; (* size of matrix or mpp dimensions *) 
m - 40; (* number of bits in the shift register *) 

mml =* 39; (* m - 1 *) 

mxand = 14; (* number of bits in generated random number *) 
type 

index - 1- dim; 

mbool - parallel arrayfindexyndex] of boolean; 
pai =* parallel array[ind exinde x] of integer; 
rang => 0-mml; 
var 

coefnrray[l_m] of rang; (* list of feedback, bits «) 
ncoef±nteger^» number of feedback bits *) 
state: parallel axray{r angyndexynde x] of boolean; 

(* array of shift registers *) 

windowrrang; (« index of last random bit of shift register *) 

DESCRIPTION 

Matrand is a function that returns an array of pseudo-random integers. Randinit initializes 
the random number register used by matrand. This register contains a random seed for every 
element of the random matrix. Initial seed random numbers are generated serially by ran- 
dinit. 

All constants, types and variables used by matrand are passed in the extern statement so that 
multiple random number generators with different parameters may be used in the same pro- 
gram. 

AUTHORS 

B. Finnerty, AJ’JR.eeves 

BUGS 

Randinit calles srand for a single positive integer random number. Srand uses overflow 
integer arithmetic. It could be replaced with a conventional Pascal random number generator. 
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NAME 

maxm — masked max reduction functions 
SYNOPSIS 

{ Slibrary reduce.pl } 

function maxml( arg : veer, mask: vec ): btype; extern; 
function maxm2( arg : matr, mask: mat ): btype; extern; 
function maxm3( arg : arr3r, mask: arr3 
function maxm4( arg : arr4r, mask: arr4 ): 
function maxm5( arg : arr5r, mask: arr5 ): 

TYPES 

vec a vector of type boolean 
mat: a matrix of type boolean 
air 3: a three dimensional array of type boolean 
arr4: a four dimensional array of type boolean 
arr5: a five dimensional array of type boolean 
veer: a vector of btype 
matn a matrix of btype 
arr3n a three dimensional array of btype 
arr4n a four dimensional array of btype 
arr5n a five dimensional array of btype 
where btype is a numeric base type 

DESCRIPTION 

These functions reduce the first argument where there are true elements in the boolean mask 
second argument. All dim ensions are automatically reduced and the final result is always a 
scalar value. The range of the dimensions of arg and mask must match. The following func- 
tions are available: 

maxml 

max reduction for vector arguments 

maxm2 

max reduction for matrix arguments 

maxm3 

max reduction for three dimensional arrays 

maxm4 

max reduction for four dimensional arrays 

maxm5 

max reduction for five dimensional arrays 


btype; extern; 
btype; extern; 
btype; extern; 
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NAME 

minm — masked min reduction functions 
SYNOPSIS 

{ Slibrary reduce.pl } 

function minml( arg : veer, mask: vec ): btype; extern; 
function minm2( arg : matr, mask: mat ): btype; extern; 

function minm3( arg : arr3r, mask: arr3 ): btype; extern; 

function minm4( arg : arr4r, mask: arr4 ): btype; extern; 

function minm5( arg : arr5r, mask: arr5 ): btype; extern; 

TYPES 

vec: a vector of type boolean 
mat: a matrix of type boolean 
arr3: a three dimensional array of type boolean 
arr4: a four dimensional array of type boolean 
arr5: a five dimensional array of type boolean 
veer: a vector of btype 
matn a matrix of btype 
arr3n a three dimensional array of btype 
arr4n a four dimensional array of btype 
arr5n a five dimensional array of btype 
where btype is a numeric base type 

DESCRIPTION 

These functions reduce the first argument where there are true elements in the boolean mask 
second argument. All dimensions are automatically reduced and the final result is always a 
scalar value. The range of the dim ensions of arg and mask must match. The following func- 
tions are available: 

minml min reduction for vector arguments 
minm2 min reduction for matrix arguments 
minm3 min reduction for three dim ensional arrays 
minm.4 min reduction for four dim ensional arrays 
minm.5 min reduction for five dimensional arrays 
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NAME 

mperm2 — Modified two dimensional mapping procedure 


SYNOPSIS 

{Slibrary mperm2.pl } 


procedure mperm2(mxrpa?7ar perm2rpa;var perm3:pa; rrpi; crpi; ro,co:btype); extern 
lol, Ml, lo2, M2; 

TYPES 

pa «• array [lol_hil4o2_hi2] of bltype; 
pi - array [lol-Ml,lo2.hi2] of itype; 

Where btype and bltype are any type and itype is an integer or subrange base type. 

VARS 

idl,id2: pi; 

Idl and id2 are two global index identifying matrices as created by twodid (see mat(2)). These 
must be initialized before using mperm2. 

DESCRIPnON 

Mperm2 is a modified version of the two dimensional mapping function perm2 (see perm2(2)) 
which can implement any into mapping and produces also 


perm2(i.j] := mx(i{i,j],cii,ji 
rl[i,j] > r{i,jl 
clfi.il cfi.j] + e; 
perm3(i,jj >• mx[rl[i,jLcl[i,ji 
where e can be 1 or -1 depending on the values of co 
(the column value of the point of rotation) and M2. 

That is, the r and c matrices contain the row and column indices, respectively, of where each 
element in perm2 is to be obtained from in mx and rl and cl the indices for perm3. 

AUTHOR 

Cristina Moura 

SEE ALSO 

mat(2),perm2(2) 
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NAME 

mx — input values into a square matrix 

SYNOPSIS 

{^library mx.pl } 


function mx3( vOO, vOl, v02, 

▼10, Til, vl2, 

▼20, v21, v22 :btype) ntype; extern; 


TYPES 

▼ij - The value to go into location (i,j) 
rtype - parallel array [0-x, 0_x] of btype; 


where btype is any base type; 

DESCRIPTION 

MX inputs the values of elements of a square matrix into the matrix It can be used to input 
values into arrays of size 2x2 to 5x5 . There are four different mx functions, one for each size 
matrix To get the ap pr o p r i ate function use mx# where # is the size of the array (Note: # = 
x+1 from the type definitions above). 


AUTHOR 

Gary Ross 
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NAME 

near and, nearor, andnn, omn — near neighbor logical functions 

SYNOPSIS 

{$library nn.pl } 

function nearandCmatrixrmtype ; k3:ktype): mtype; extern; 
function nearorCmatrixantype; k3 detype): mtype; extern; 

function andnn(matrixrmtyp e; maskidirset): mtype; extern; 
function oimnCmatrixnntype; maskxiirset): mtype; extern; 

TYPES 

mtype *■ parallel array [lol_hil, lo2-hi2] of boolean; 
ktype - parallel array [0-2, 0-2] of integer; 

neighbors = (NW,N,I^W,C£SW,S4>E); 
dirset =* set of neighbors; 

DESCRIPTION 

The near neighbor logical functions logically combine (AND or OR) each matrix element with 
its near neighbors according to the second argument. The set of neighbors to be combined is 
specified in two different ways. 

In near and and nearor the set of neighbors is specified by a 3x3 kernel. Those neighbors that 
are to be selected will have positive (or true) values in the corresponding position in the 
kerenel. 

In andnn and omn the set of neighbors is specified by a direction set. The directions are NW, 
N, NE, W, E, SW, S, SE corresponding to compass directions, as well as C indicating the central 
element. 

AUTHOR 

Gary Ross 
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NAME 

niynnet^recur — BASE assembly language functions 

SYNOPSIS 

{Slibrary base.pl } 

function nn(d:direction, b:bplane):bplane; extern; (* edge false *) 
function nnet(d;dir ection, b;bplane);bplane; extern; (* edge true *) 

procedure recur(var nbplane, bibplane): bplane; extern; 

procedure readbp(var bptbplane); extern; 
procedure writebp(bp: bplane); extern; 

CONST 

diml = (* first dim ension of the bitplane matrix *) 
dim 2 = (* second dim ension of the bitplane matrix *) 

TYPES 

bplane => paxrallel array [l_diml, l_dim2] of boolean; 
directon » set of 0-8; 

VARS 

terminated; boolean; 

DESCRIPTION 

These functions form the primitives for the BASE binary array processor (BAP) assembly 
language. The may also be used for general BAP operations and for implementing other BAP 
assembly languages. 

Nn computes an OR function over the selected near neighbors of a bitplane (boolean matrix). 
The near neighbors are labeled as follows: 

1 2 3 
8 0 4 
7 6 5 

Nnet is similar to nn except that near neighbor values outside the edge of the matrix are con- 
sidered to be true rather than false. 

recur is used to implement recursive instructions. It compares the old and new values for 
each iteration of the instruction and sets terminated to true when they are identical. See the 
example section for the use of this function. 

Readbp reads a bitplane matrix from a file. The data values in the file are either 0 or 1 which 
are converted to false and true respectively. Writebp writes a boolean bitplane to a file using 
the 0 1 format. 

EXAMPLES 

Examples are given below of the syntax structures which are valid for the BASE instruction 
set. Other BAP instruction sets may be emulated. 

const NW = 1; N = 2; NE = 3; E = 4; SE => 5; S =* 6; SW = 7; W = 8; 
begin 

(* Boolean *) 

r := {Boolean function of three variables}; 
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(* Simple near neighbor *) 
begin 

t > nn([{ direction list}], {bplane variable}); 

{the bplane variable may be preceeded by not} 
r > {Boolean expression of two variables and t} 
end 

{In most cases a single statement involving nn may be used} 

(* Recursive near neighbor *) 

repeat recurCr, {simple near neighbor expression}) until ter min ated; 

(* Global feature extraction *) 

bor > any{{bplane variable}, 1, 2); 
isum > sum(ord({ bplane variable}), 1, 2); 

(* Examples *) 
r > a; 

r :=» a and b or c and d; 
r >» a Ob; 

r :=■ nnetdWl a); 

r>a and nn([N^E,Wl a); 

r :=> nndl-8], a); 

repeat recurCr, r or nndN], r)) until terminated; 

repeat recur(r, r and not and 1-8], not r)) until terminated; 

bor :=» any(a, 1, 2); 
end- 

(* bitplane stacks may be defined as an array of bitplanes *) 
var 

bpst: array{0_7] of bplane; 

(* bpst: array(0-7, 1-dim 1, l_dim2] of Boolean *) 

(* bpst{0] is the least significant bitplane *) 

AUTHOR 

A-P-Reeves 
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NAME 

perm2,perm2s — General two dimensional mapping function 


SYNOPSIS 

{Slibrary penn2.pl } 


function perm2(mxrpa; rtpi; cqpilrpa; extern lol. Ml, lo2, M2; 
function perm2s(mxrpa; rrpi; crpi):pa; extern lol. Ml, lo2, M2; 

TYPES 

pa - array [lol-hil,lo2«hi2] of btype; 
pi « array [lolJiil,lo2~hi2] of itype; 

Where btype is any type and. itype is an integer or subrange base type. 

VARS 

idl,id2: pi; 

Idl and id2 are two global index identifying matrices as created by twodid (see mat(2)). These 
must be initialized before using per m 2. 

DESCRIPTION 

Perm 2 is a general purpose two dimensional mapping f unction which can implement any into 
mapping. It is designed for a parallel computer architecture and uses a heuristic approach to 
reduce the execution time. 

Permls is a version of perm2 designed for serial computers. The transformation implemented 
by perm2 and perm2s is as follows: 


perm2[i,j] > mxMvjlckjl 

That is, the r and c matrices contain the row and column indices, respectively, of where each 
element in perm2 is to be obtained from in mx. 

AUTHOR 

A. P. Reeves 

SEE ALSO 

mat(2) 
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NAME 

prodm — masked prod reduction functions 
SYNOPSIS 

{ Slibrary rednce.pl } 

function prodm lC arg : veer, mask: vec ): btype; extern; 
function prodm 2( arg : matr, mask: mat ): btype; extern; 
function prodm3( arg : arr3r, mask: arr3 ): btype; extern; 
function prodm4( arg : arr4r, mask: arr4 ): btype; extern; 
function prodm5( arg : arr5r, mask: arr5 ): btype; extern; 

TYPES 

vec: a vector of type boolean 
mat: a matrix of type boolean 
arr3: a three dimensional array of type boolean 
arr4: a four dimensional array of type boolean 
arr5: a five dimensional array of type boolean 
veer: a vector of btype 
matn a matrix of btype 
arr3r: a three dimensional array of btype 
arr4r: a four dim ensional array of btype 
arr5r: a five dimensional array of btype 
where btype is a numeric base type 

DESCRIPTION 

These functions reduce the first argument where there are true elements in the boolean mask 
second argument. All dimensions are automatically reduced and the final result is always a 
scalar value. The range of the dimensions of arg and mask must match. The following func- 
tions are available: 

prodml 

prod reduction for vector arguments 

prodm2 

prod reduction for matrix arguments 

prodm3 

prod reduction for three dimensional arrays 

prodm4 

prod reduction for four dimensional arrays 

prodm5 

prod reduction for five dimensional arrays 
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NAME 

pyr ami d bpyr— pyramid convolution functions 
SYNOPSIS 

{$library pyramid.pl } 

function pyramid (imag ery t y p e; weight s rw t y p e l);r typ e; 
extern nrows ncols type; 

function bpyr(image;btype; weights: w type 2): btype; 
extern nrows ncols; 

function pyrmskg( idl: itype; id2titype):btype extern 0, nrows, 0, ncols; 
function pyrgen( idltitype; id2;itype; uptboolean; dinuinteger)titype; 
extern 0, nrows, 0, ncols; 

TYPES 

gtype - parallel array [0-nrows, 0-ncols] of (real or integer); 
wtypel - parallel array [0-13] of (real or integer); 
btype - parallel array [0-nrows, 0-ncols] of boolean; 
itype - parallel array [0-nrows, 0-ncols] of integer; 
wtype2 =■ parallel array [0-13] of boolean; 

EXTERN CONSTANTS 

nrows =* The largest row number of the image matrix 

ncols = The largest column number of the image matrix 

type = The data type of the weights vector (Le the word integer) 

VARS 

idl,id2,upl,dnl,up2,dn2: itype 
pyrmsk: btype; 

Idl and id.2 are two global index identifying matrices as created by twodid (see mat(2)). 
These must be iniatailized first. The other matrices specify transformations for managing 
pyramid data; see the description section for their initialization. 

DESCRIPTION 

The functions pyramid and bpyr are convolution functions for pyramid structure images 
stored in a two dimensional matrix. The successive levels of the pyramid structure are stored 
in successive rows of the ima ge matrix with the lowest level image (l pixel image) being 
located at position [0,0] in the input matrix. Pyramid is designed for use with mnumeric data 
while bpyr is suitable for boolean data. 

The pyr ami d operations require several constant matrices; these are declared globally for 
efficiency. The global variables are generated with the functions twodid (see mat(2)), pyrmskg 
which generates the boolean pyr ami d constraint mask, and pyrgen which generates all shift 
matrices used for moving up and down the pyramid. A typical program code for setting up 
these matrices is dhown below; 

twodid( idl, id2); 
pyrmsk := pyrmskgOdl, id2); 
upl > pyrgen (idl, id2, true, l); 
dnl := pyrgen (idl, id2, false, l); 
up2 := pyrgen (idl, id2, true, 2); 
dn2 :=* pyrgen (idl, id2, false, 2); 
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The weight vector for the pyr ami d convolution is organized as follows: 
w[0] parent 
w{l] w{2] w{3] 

w(4] w{5] w {6] same plane near neighbors 
w(7] w{8] w{9] 

w{lO] w{ll] children 
w{12] w{l3j 

The gather library function may be used to do explicit pyramid manipulation. An upward 
shif t of all levels of the pyr ami d can be achieved with: 

pyr > gather (gather ( pyr, upl, l), up2, 2>, 

A downward shift of all levels of the pyramid can be achieved with: 

pyr >■ gather (gather ( pyr, dnl, l), dn2, 2>, 

Horizontal shifting of all levels of the pyramid can be achieved with the library function 
xshift with the mask pyrmsk as follows: 

pyr > xshift (pyr, x, y, pyrmsk); 


SEE ALSO 

xshift(2), conv(2), xconv(2), varshift(2),mat(2) 
AUTHOR 

Gary Ross and A. P. Reeves 
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NAME 

spread, gather — variable shif t functions 

SYNOPSIS 

{$library sh.ift.pl } 

function spreadCoiigritype; shmaskantype; dimrintegerhitype; 
extern nrows, ncols; 

function gather(origtitype; shmaskantype; dimtintegerMtype; 
extern nro'ws, ncols; 

TYPES 

itype » parallel array [CLnrows, 0_ncols] of (integer or real); 
mtype =* parallel array [O-arows, CLncols] of boolean; 

EXTERN CONSTANTS 

nrows - the highest row number of the input matrix 
ncols the highest col umn number of the input matrix 

VARS 

idl,id2: itype; 

Gather accesses idl and id2 which are two global index identifying matrices as created by 
twodid (see mat(2)). These must be initialized before using gather; gather does not change 
their values. 

DESCRIPTION 

The functions spread and gather are both two dim ensional variable shift functions. Given a 
two dimensional input matrix, a shift mask, and a direction these functions will shift each ele- 
ment of the input array an amount specified by the shift mask. 

Shifting can only be done in one direction at a time, therefore the direction must be specified 
by the value of the parameter dim. 

The difference between the two functions is in how the shift mask (shmask) specifies where to 
shift each element. For the function spread, the shift mask indicates how far the correspond- 
ing element in the input matrix should be moved. Both positive and negative values are 
allowed for the shift mask. For the function gather, the shift mask defines where the result 
at the corresponding location expects to get its value from (Le. the row number or the column 
number, depending on the direction specified). 

CAUTIONS 

In the function spread it is possible for more than one element to be shifted to the same loca- 
tion. If this occurs the element that is shifted the farthest will be the final result. (Note the 
shift mask is converted to be all positive values and the shifts performed are actually rotates. 
Thus a shift value of -1 is actually shifted farther than a shift value of +1) 

Just as likely in the function spread is that nothing is shifted to a given location in the result. 
In this instance the resulting value will be zero. 

In the function gather, the s hmas k should consist of non-negative values only. A negative 
value in the shift mask indicates that the resulting value at the corresponding position in the 
result will be zero. 

AUTHOR 

Gary Ross 
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NAME 

summ — masked sum reduction functions 
SYNOPSIS 

{ $library reduce.pl } 

function summl( arg : veer, mask; vec ); btype; extern; 
function summ2( arg : matr, mask: mat ): btype; extern; 
function summ3( arg : arr3r, mask; arr3 ): btype; extern; 
function summ4( arg : arr4r, mask; arr4 ); btype; extern; 
function summ5( arg : arr5r, mask; arr5 ): btype; extern; 

TYPES 

vec a vector of type boolean 
mat: a matrix of type boolean 
arr3: a three dimensional array of type boolean 
arr4: a four dimensional array of type boolean 
arr5: a five dimensional array of type boolean 
veer: a vector of btype 
matn a matrix of btype 
arr3n a three dimensional array of btype 
arr4r: a four dimensional array of btype 
arr5n a five dim ensional array of btype 
where btype is a numeric base type 

DESCRIPTION 

These functions reduce the first argument where there are true elements in the boolean mask 
second argument. All dimensions are automatically reduced and the final result is always a 
scalar value. The range of the dimensions of arg and mask must match. The following func- 
tions are available: 

summl 

sum reduction for vector arguments 

summ2 

sum reduction for matrix arguments 

sum m3 

sum reduction for three dim ensional arrays 

summ4 

sum reduction for four dimensional arrays 

summ5 

sum reduction for five dim ensional arrays 
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NAME 

writemx,twodid,twodids — general matrix functions 

SYNOPSIS 

{Slibrary mat.pl } 

procedure wTitemx(mx^type;fmtrinteger); extern lol. Ml; 

procedure twodidCvar idl.*ptype; var id2:ptype); extern lol,hil,lo2^ii2; 
procedure twodids(var idliptype; var id2rptype); extern lol,M14o2,hi2; 

TYPES 

ptype - array [lol-MlJo2_hi2] of ntype; 

Where ntype is a numeric base type; real, integer or subrange. 

DESCRIPTION 

Writemx is a convenient procedure for printing the contents of a small numeric matrix. The 
fmt parameter specifies the field width for each element to be printed. Writemx is not clever 
enough to know when the width limit on an output line has been exceeded- 

Twodid generates two integer matrices idl and id2 which can be used for identifying each 
element in a parallel array operation. The contents of these arrays are defined as follows: 

idl[i,j] = i id2{i,j] =■ j 

T wodids is a serial version of twodid. 

AUTHOR 

A. P. Reeves 
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NAME 

xconv — constrained matrix convolution function 
SYNOPSIS 

{$library convolve.pl } 

function xconv(image;gtype^t:kernelpnaskantype):gt7pe; 
extern size; 

TYPES 

gtype = parallel array [CLx, 0_y] of (real or integer); 
kernel= parallel array [0_size-l, 0_size-l] of (real or integer); 
mtype =* parallel array [0_x, 0_y] of boolean; 

EXTERN CONSTANT 

size : integer constant specifying the size of the kernel. 

( Le. for a 3 x 3 kernel size=3 ) 

DESCRIPTION 

Xconv is a version of matrix convolution where pseudo-edges are defined internally to the 
image matrix by use of a boolean mask matrix. The definition of edges internal to the image 
matrix prevents data from crossing the defined boundaries. 

SEE ALSO 

xshift(2), conv(2) 

AUTHOR 

Gary Ross 
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NAME 

xshift — constrained shift function 

SYNOPSIS 

{Slibrary shift } 

function xshift(origri.type; x,y: integer; maskantypeMtype; 

TYPES 

itype =■ parallel array [CLx, 0_y] of (integer or real); 
mtype = parallel array [0-x, 0_y] of boolean; 

DESCRIPTION 

Xshift is a two dimensional shift function where the boolean matrix mask defines pseudo- 
edges insi de the array being shifted. The mask plane should be set to true only in those places 
where the user wishes to define an lower edge or a right edge. This definition introduces a 
non-symmetrical response in doing a shif t. When shif ting down or to the right the effect is as 
if an edge existed where defined in the mask. When shifting up the effect is the same as hav- 
ing an edge one row down from where it is defined in the mask. Also when shifting left, the 
effect is as if there were an edge one column to the right of where it is defined in the mask. 

NOTE 

It may be useful to think of the edge mask as defining an edge which is half a pixel down and 
half a pixel to the right of where it is defined in the edge mask. 

AUTHOR 

Gary Ross 
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LIBRARY FUNCTION SUBJECT INDEX 


General Utilities 

ceiling - round up to integer value 

matrancLrandinit - matrix random number generator 
writemx,twodid,twodids- general matrix functions 


Masked Reduction Functions 
allm - masked all reduction functions 

anym. - masked any reduction functions 

maxm - masked max reduction functions 

minm - masked min reduction functions 

prodm - masked prod reduction functions 

summ - masked sum reduction functions 


Large Array Utilities 


crshift^rrotate 

crshiftg,crrotateg 

Ishiftdrotate 

lsiiiftgdrotateg 


- shift a large crinkled array on a parallel computer 

- shift a large crinkled array on a parallel computer 

- shif t a large array on a parallel computer 

- shift a large array on a parallel computer 


Pennutaion Functions 

perm2,perm2s - General two dimensional mapping function 
mperm2 - Modified two dimensional mapping procedure 

lperm2 - permute data in a large array on a parallel computer 

spread, gather - variable shift functions 

xshift - constrained shift function 


Matrix Rotation Functions 


irotatejxrotate 

blint 

cint 

lirotatejnrotate 

lblint 

lcint 


- Rotation matrix generators 

- Bilinear interpolation procedure for a matrix 

- Cubic interpolation procedure for a matrix 

- Rotation matrix generators for large matrices 

- Bilinear interpolation procedure for a large matrix 

- Cubic interpolation procedure for a large matrix 


Near Neighbor and Convolution Functions 
compn - near neigbor comparison function 

conv,convg - matrix convolution functions 

iconv, rconv, bconv - matrix convolution functions 
mx - input values into a square matrix 

nearand, nearor, andnn, omn- near neighbor logical functions 
nnmnetrecur - BASE assembly language functions 
pyramid, bpyr - pyramid convolution functions 
xconv - constrained matrix convolution function 



